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iterations  10,000  times.  In  each  case,  the  starting  value  was  X0  = 1. 

This  is  a histogram  of  the  10,000  iid  copies  of  Xi$.  The  solid  line  is 
(an  appropriately  scaled  version)  of  the  stationary  density 20 

2.2  The  independence  Metropolis  algorithm  with  9 = 4 was  run  for  15 

iterations  10,000  times.  In  each  case,  the  starting  value  was  X0  = 1. 

This  is  a histogram  of  the  10,000  iid  copies  of  Xi$.  The  solid  line  is 
(an  appropriately  scaled  version)  of  the  stationary  density 21 

2.3  The  independence  Metropolis  algorithm  with  9 = 4 was  run  for  1,000 

iterations  10,000  times.  In  each  case,  the  starting  value  was  X0  = 1. 

This  is  a histogram  of  the  10,000  iid  copies  of  Xiooo-  The  solid  line 
is  (an  appropriately  scaled  version)  of  the  stationary  density 22 

2.4  Consider  the  family  of  densities  IG  f^yT,  y + y(/i  — y)2)  as  /r  ranges 

over  the  set  that  is,  as  (/i-y)2  ranges  between  0 and  d.  Suppose, 
for  example,  that  m = 5,  s2  = 10,  and  d = 22/5.  Then  the  shape 
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Five  of  these  densities,  including  the  extremes,  are  pictured  above. 

The  point  of  intersection  of  the  two  extremes  is  9*  = 4.73.  It  is  clear 
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9*,  IG(2,5)  is  the  smallest 38 
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Two  important  questions  that  must  be  answered  whenever  a Markov  chain  Monte 
Carlo  algorithm  is  used  are  (Ql)  What  is  an  appropriate  burn-in ? and  (Q2)  How 
long  should  the  sampling  continue  after  burn-in?  Developing  rigorous  answers  to 
these  questions  presently  requires  a detailed  study  of  the  convergence  properties  of 
the  underlying  Markov  chain.  Consequently,  in  most  practical  applications  of  MCMC, 
exact  answers  to  (Ql)  and  (Q2)  are  not  sought.  One  goal  of  this  dissertation  is  to 
demystify  the  analysis  that  leads  to  rigorous  answers  to  (Ql)  and  (Q2). 

The  ability  to  formally  address  (Ql)  and  (Q2)  comes  from  establishing  a drift 
condition  and  an  associated  minorization  condition,  which  together  imply  that  the 
underlying  Markov  chain  is  geometrically  ergodic.  We  explain  exactly  what  drift  and 
minorization  are  as  well  as  how  and  why  these  conditions  can  be  used  to  form  rigorous 
answers  to  (Ql)  and  (Q2).  The  basic  ideas  are  as  follows.  Once  we  have  established 
the  appropriate  drift  and  minorization  conditions  we  can  employ  existing  results  to 


IX 


construct  a formula  giving  an  exact  upper  bound  on  the  distance  to  stationarity.  A 
definitive  answer  to  (Ql)  can  be  calculated  using  this  formula.  The  desired  char- 
acteristics of  the  target  distribution  are  typically  estimated  using  ergodic  averages. 
Geometric  ergodicity  of  the  underlying  Markov  chain  implies  that  there  are  central 
limit  theorems  available  for  ergodic  averages.  The  regenerative  simulation  technique 
can  be  used  to  get  a consistent  estimate  of  the  variance  of  the  asymptotic  Normal 
distribution.  Hence,  an  asymptotic  standard  error  can  be  calculated,  which  provides 
an  answer  to  (Q2)  in  the  sense  that  an  appropriate  time  to  stop  sampling  can  be 
determined. 
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CHAPTER  1 

A GENTLE  INTRODUCTION  TO  MARKOV 
CHAIN  MONTE  CARLO 

1.1  Introduction 

Realistic  statistical  models  often  give  rise  to  probability  distributions  that  are 
computationally  difficult  to  use  for  inference.  Fortunately,  there  is  a collection  of 
algorithms,  known  as  Markov  chain  Monte  Carlo  (MCMC),  that  has  brought  many 
of  these  models  within  computational  reach.  Over  the  last  ten  years,  a staggering 
amount  of  research  has  been  done  on  both  theoretical  and  applied  aspects  of  MCMC. 
Thus  this  chapter  is  not  intended  to  be  a complete  overview  of  MCMC.  However,  we 
do  hope  to  get  the  reader  started  in  the  right  direction.  To  this  end,  we  begin  with  a 
general  description  of  the  types  of  problems  that  necessitate  the  use  of  MCMC.  Then 
the  fundamental  algorithms  are  introduced  and  some  general  implementation  issues 
addressed.  Discussion  of  the  Markov  chain  theory  underpinning  MCMC  algorithms 
has  been  intentionally  avoided  here  since  it  will  be  considered  in  some  detail  in  the 
next  chapter. 

MCMC  is  a simulation  technique  that  allows  one  to  make  (approximate)  draws 
from  complex,  high  dimensional  probability  distributions.  While  the  Bayesians  have 
certainly  made  the  most  use  of  MCMC,  applications  have  popped  up  in  many  dif- 
ferent areas  of  statistics.  For  example,  MCMC  techniques  can  be  used  to  calculate 
P-values  in  exact  conditional  inference  (Diaconis  and  Sturmfels  1998)  and  to  maxi- 
mize intractable  likelihood  functions  associated  with  generalized  linear  mixed  models 
(McCulloch  1997,  McCulloch  and  Searle  2001).  Gilks,  Richardson  and  Spiegelhalter 
(1996)  provide  a wonderful  introduction  to  applied  MCMC.  Those  interested  in  a 
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more  theoretical  introduction  should  consult  Besag,  Green,  Higdon  and  Mengersen 
(1995),  Robert  and  Casella  (1999)  and  Roberts  and  Rosenthal  (1998a). 

Specific  applications  where  MCMC  is  useful  include  spatial  statistics  (Besag  and 
Green  1993),  image  analysis  (Besag  1993),  remote  sensing  (Green  and  Strawderman 
1994),  capture-recapture  studies  (Feinberg,  Johnson  and  Junker  1999),  agricultural 
field  experiments  (Besag  et  al.  1995),  animal  breeding  (Wang,  Rutledge  and  Gianola 
1993),  and  genetics  (Wilson  and  Balding  1998).  Of  course,  this  list  is  not  meant 
to  be  exhaustive.  Indeed,  as  one  attempts  to  accurately  model  natural  phenomena, 
complex,  high  dimensional  probability  distributions  become  the  rule  rather  than  the 
exception.  We  now  give  an  example  that  is  motivated  by  the  salamander  data  set 
introduced  by  McCullagh  and  Nelder  (1989,  Section  14.5). 

Example  1.1.1  Suppose  there  are  I female  and  J male  salamanders  of  a certain 
species,  and  the  experiment  consists  of  coupling  each  male  with  each  female.  The 
data  consist  of  TV  = IJ  binary  observations,  yij , indicating  the  success  or  failure  of 
the  mating  episode  of  the  zth  female  with  the  jth  male.  Let  Uj  denote  the  (random) 
effect  of  the  ith  female  and  Vj  the  (random)  effect  of  the  jth  male  and  assume  that 
conditional  on  u = («i, . . . , uj)T  and  v = (vi, . . . , vj)T  the  s are  independent  with 
yij\ui,Vj  ~ Bernoulli(7Tjj)  where  is  the  probability  of  a successful  mating;  i.e., 
7Tjj  = Pr(Fij  = 1).  Further  assume  that 


where  p.  is  unknown.  To  finish  the  model  specification  we  assume  that  the  Ui  s are 
iid  (independently  and  identically  distributed)  N(O,0„)  and  independent  of  the  VjS, 
which  are  assumed  to  be  iid  N(O,0„).  Let  i/>  = (fj,,9u,6v)T  denote  the  vector  of 
unknown  parameters. 
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The  marginal  likelihood  function  is  given  by 


exp  {yjjin  + Uj  + Vj)} 
1 + exp{/r  + Ui  + Vj} 


where  c is  a constant  that  does  not  depend  on  i/>.  A standard  frequentist  analysis 
requires  maximizing  the  marginal  likelihood  function  with  respect  to  i />,  i.e. , finding 
the  maximum  likelihood  estimator  of  t/ Due  to  the  crossed  nature  of  the  design, 


of  standard  maximization  techniques  such  as  Newton- Raphson. 

An  obvious  alternative  to  direct  maximization  in  this  case  is  an  EM  algorithm 
(Dempster,  Laird  and  Rubin  1977)  where  the  random  effects  are  treated  as  missing 
data.  However,  the  E-step  of  this  algorithm  requires  calculating  an  expectation  with 


is  a complex,  high  dimensional  density,  and  consequently  a closed  form  answer  at  the 
E-step  is  impossible.  We  will  return  to  this  example  in  Section  1.2. 

What  should  one  do  when  faced  with  a situation,  as  in  Example  1.1.1,  where  the 
required  probability  or  expectation  is  impossible  to  calculate  analytically?  Typically, 
one  would  be  willing  to  settle  for  a good  approximation.  We  could  use  an  analytical 
approximation,  such  as  the  Laplace  approximation  (Tierney,  Kass  and  Kadane  1989), 
or  a numerical  approximation.  If  we  decide  to  employ  a numerical  method  then  there 
are  several  options  available.  Specifically,  we  could  use  standard  numerical  integra- 
tion, e.g.,  quadrature  methods  (Abramowitz  and  Stegun  1964),  quasi-Monte  Carlo 
methods  (Fang,  Wang  and  Bentler  1994),  classical  Monte  Carlo  integration  (Robert 
and  Casella  1999),  or  Markov  chain  Monte  Carlo  methods.  Each  of  these  methods  has 
advantages  and  disadvantages  when  compared  to  the  others  making  the  appropriate 


(1.1)  involves  an  intractable  integral  of  dimension  IJ.  This  integral  precludes  the  use 


respect  to  f(u , u|y;  ■0^),  the  conditional  density  of  the  random  effects  given  the  data 
(with  the  parameters  set  equal  to  the  current  values).  Unfortunately,  f(u,v |y;0^) 
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choice  highly  problem  specific.  Generally,  deterministic  methods  will  converge  more 
quickly  than  sampling-based  approaches  when  the  dimension  of  the  problem  is  small. 
On  the  other  hand,  as  the  dimension  of  the  integrals  increases,  the  computational 
cost  of  deterministic  methods  becomes  prohibitive  (Traub  and  Wozniakowski  1994) 
making  sampling-based  methods  preferable.  If  a sampling-based  method  is  to  be 
used,  one  should  first  attempt  to  use  classical  Monte  Carlo  methods  that  are  based 
on  iid  samples  from  the  distribution  of  interest.  Finally,  if  the  distribution  of  interest 
is  too  high-dimensional  and  complex  to  allow  iid  sampling,  then  MCMC  algorithms 
may  provide  the  wherewithal  to  approximate  the  integrals. 

In  order  to  make  our  discussion  concrete,  we  require  some  notation.  Let  x = 
(x\,  X2, . . . , xp)T  and  suppose  n(x)  is  a density  function  defined  on  X C Rp  with 
p > 1.  The  basic  problem  may  be  formulated  as  follows.  Suppose  we  require  the 
expectation  of  some  function  h with  respect  to  7r(x);  that  is, 

Enh  := 

but  this  integral  cannot  be  evaluated  analytically.  (We  are  of  course  assuming  that 
EM  :=  /,  \h(x)\  7r(x)  dx  < oo.) 

Classical  Monte  Carlo  integration  requires  an  iid  sample  x^\  x@\  . . . , x^  from 
7r(x).  This  is  often  done  with  rejection  sampling  (Robert  and  Casella  1999,  Chapter 
2).  With  the  sample  in  hand,  we  can  construct  the  following  strongly  consistent 
estimate  of  Enh: 

hn:=-Th(x{i)).  (1.2) 

n z J 

i=i 

Here  we  are  simply  taking  advantage  of  the  strong  law  of  large  numbers  (SLLN).  Of 
course,  it  would  be  helpful  to  know  just  how  large  n needs  to  be  in  order  for  the 
approximation  to  be  good.  Under  iid  sampling  (and  the  assumption  that  Enh2  < oo) 
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it  is  easy  to  obtain  a measure  of  the  accuracy  of  hn.  Specifically,  note  that 

1 . 71  1 

var Chn)  = — - var (h(x^))  = — vai(h(x^)) 

n 2 z— ' n 

1=1 

which  is  usually  estimated  by 

yn  :=  — ^ ( h(x w)  - h„)  . 

2=1 

Now  since 

h<n  E-jfh 

\fvn 

is  approximately  N(0, 1)  when  n is  large,  we  can  construct  a valid  (asymptotic)  con- 
fidence interval  for  Enh.  We  can  then  say  that  hn  is  sufficiently  accurate  (i.e.,  n is 
large  enough)  when  the  width  of  the  confidence  interval  is  small.  Note  how  simple  it 
is  to  assess  the  accuracy  of  the  estimate  in  this  classical  Monte  Carlo  setting. 

Now  suppose  that  generating  iid  samples  from  7 r(sc)  is  either  impossible  or  ex- 
tremely inefficient.  If  applicable,  MCMC  techniques  will  allow  one  to  obtain  a sample 
x^l\  . . . , x that  is  only  approximately  from  7r(*),  is  neither  independent  nor 
identically  distributed  and  yet,  remarkably,  can  still  be  used  via  (1.2)  to  obtain  a 
strongly  consistent  estimate  of  E^h.  Instead  of  the  SLLN,  we  are  now  appealing  to 
the  ergodic  theorem  for  Markov  chains.  Unfortunately,  assessing  the  accuracy  of  the 
estimate  is  much  more  difficult  in  the  MCMC  context  (Geyer  1992).  This  is  one  of 
the  main  reasons  why  we  regard  MCMC  as  a last  resort  that  should  be  used  only 
when  classical  Monte  Carlo  methods  are  prohibitively  difficult. 

In  the  next  section  we  introduce  the  Metropolis-Hastings  algorithm  and  the  Gibbs 
sampler,  which  are  the  fundamental  MCMC  algorithms.  Hybrid  algorithms  are  those 
in  which  different  versions  of  these  fundamental  algorithms  are  used  simultaneously. 
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1.2  MCMC  Algorithms 


We  begin  with  the  Metropolis-Hastings  algorithm.  Recall  that  the  goal  is  to 
simulate  (approximately)  from  n(x).  To  begin  we  must  choose  a conditional  density, 
q(y\x)  say,  such  that  for  each  fixed  x,  it  is  straightforward  to  draw  random  variates 
from  q(-\x).  Here  is  the  basic  Metropolis-Hastings  algorithm. 

1.  Choose  an  arbitrary  starting  value  x^  and  set  i — 0. 

2.  Sample  an  observation  y from  q(y |*W). 

3.  Set  a;(l+1)  = y with  probability 


otherwise  reject  y and  set  x^l+l^  = x^\  Put  i = i + 1. 

4.  If  i < n,  return  to  Step  2,  otherwise  stop. 

There  are  two  especially  noteworthy  features  of  this  algorithm.  Firstly,  suppose 
that  we  know  ir(x)  only  up  to  a normalizing  constant;  that  is,  n(x)  = ctt*(x),  where 
we  have  tt*(x),  but  c is  an  unknown  constant.  Since  we  can  replace  n with  n*  in 
the  formula  for  a(x^\y)  without  changing  anything,  we  can  still  use  the  Metropolis- 
Hastings  algorithm  in  this  case.  Secondly,  the  flexibility  of  this  algorithm  is  evident 
since  it  works  for  almost  any  choice  of  q that  has  at  least  the  same  support  as 
7r.  (See  Chib  and  Greenberg  (1995)  for  an  extensive  yet  simple  introduction  to  the 
Metropolis-Hastings  algorithm.) 

In  the  next  two  examples  we  introduce  common  ways  to  choose  the  conditional 
density  q,  which  is  often  referred  to  as  the  candidate  density.  In  both  cases  we 
illustrate  the  required  calculations  with  “toy”  examples  that  do  not  require  MCMC 
in  order  to  simulate  from  the  target  distribution. 

Example  1.2.1  If  we  choose  q(y\x)  = q{y)  so  that  the  proposed  move  is  independent 
of  the  current  state,  then  we  have  the  independence  sampler.  As  a specific  application, 
suppose  that  the  target  distribution  is  an  Exponential  with  mean  1;  that  is,  the 
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density  is  given  by 

7r(x)  = e~xI(x  > 0) 

Then  an  independence  sampler  would  result  if  we  chose  as  the  candidate  an  Expo- 
nential with  mean  1//3,  or 

q(y)  = pe~pvI(y  > 0) 

If  the  current  state  is  x,  then  the  proposal,  y,  will  be  accepted  with  probability 
a(x,y)  = min  , 1^  = min  , l) 

and  otherwise  the  chain  will  remain  at  x. 

Example  1.2.2  The  random  walk Metropolis-Hastings  algorithm  results  when  q(y\x) 
q(y  — x)  = q(x  — y).  For  example,  suppose  that  the  target  distribution  is  N(0, 1) 
and  the  proposal  is  drawn  from  a N(x,  1)  distribution,  i.e., 

If  the  current  state  is  x,  then  the  proposal,  y,  will  be  accepted  with  probability 

a(x,  y)  = min  (exp  [0.5(x2  — y2)]  , l) . 


Example  1.1.1  continued.  Here  is  a somewhat  more  realistic  example  of  the  use 
MCMC.  Let  9u\  6v^)T  denote  the  value  of  the  parameter  after  the  rth 

iteration  of  EM.  The  E-step  at  the  (r  + l)th  iteration  of  EM  requires  calculating  a 
certain  expectation  with  respect  to  the  density  f(u,  v|y;  which  is  given  by 


C («ir))"1  Wr)) 


IIt 


L 


exp{y«j(/x(r)  + 

+ exp{/x(r)  +U{  + Vj} 


exp  < - 


where  c is  an  unknown  normalizing  constant.  There  is  obviously  no  hope  of  analyti- 
cally evaluating  an  expectation  with  respect  to  this  density. 
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The  Monte  Carlo  EM  algorithm  (Wei  and  Tanner  1990)  involves  replacing  this  in- 
tractable expectation  with  a Monte  Carlo  approximation  of  the  form  (1.2).  However, 
if  IJ  is  very  large  (say  > 40),  it  will  be  essentially  impossible  to  make  iid  draws  from 
f(u,v |y;t//r^)  (Booth  and  Hobert  1999).  Fortunately,  McCulloch  (1997)  has  devel- 
oped a Metropolis-Hastings  algorithm  for  this  situation.  McCulloch’s  algorithm  is  a 
bit  more  complicated  than  the  ones  we  have  described  so  far.  It  is  a so-called  “one-at- 
a-time”  Metropolis-Hastings  algorithm.  It  involves  cycling  through  the  components 
of  (uT,  vT)T  one  at  a time,  and  updating  each  one  using  an  independence  sampler 
with  candidate  either  N(0, 0^)  or  N(0, 0^),  depending  on  whether  the  component 
to  be  updated  is  a u or  a v.  (Note  that  / is  playing  the  role  of  7 r in  this  example.) 

The  Gibbs  sampler  (Gelfand  and  Smith  1990  and  Casella  and  George  1992)  is 
also  a basic  MCMC  algorithm.  This  algorithm  is  most  useful  when  we  can  identify 
full  conditional  densities  that  are  easy  to  sample  from  directly.  As  above,  consider  a 
random  vector  x = (xi, . . . ,xp)T . Let  = (xi, ... , £;_i , rr;+i , ...  ,xp)T.  Then  the 
full  conditional  densities  fi(xi\x-i),  for  i — 1, . . . ,p  are  fi(xi\x-i)  = n(x)/  f tt(x)  dxi. 
Hence,  as  a function  of  xi:  fi(xi\x-i)  oc  7r(a:).  This  proportionality  allows  us  to  use 
the  standard  techniques  described  by  Gilks  (1996)  to  identify  the  full  conditional 
densities.  Here  is  the  basic  Gibbs  sampling  algorithm. 

1.  Choose  an  arbitrary  starting  value  and  set  i = 0. 

2.  Generate  in  p steps  as  follows 


,6+i) 


,6+i) 


(1.3) 


and  set  i = i + 1. 


3.  If  z < n,  return  to  Step  2;  otherwise  stop. 
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We  are  assuming  here  that  each  Xi  is  univariate,  but  this  is  actually  not  necessary. 
If  at  least  one  of  the  Xi  is  multivariate,  then  the  phrase  block  Gibbs  sampler  is  used. 
When  p = 2,  the  algorithm  is  sometimes  referred  to  as  data  augmentation.  It  is  also 
not  necessary  to  update  the  components  in  a fixed  order.  Indeed,  some  flavors  of  the 
Gibbs  sampler  use  random  permutations  for  the  update  order  at  every  step.  Actually, 
the  variations  on  this  theme  are  limited  only  by  one’s  imagination. 

An  important  feature  of  both  the  Gibbs  sampler  and  the  Metropolis-Hastings 
algorithm  is  easy  implementation.  However,  it  is  not  always  clear  which  one  should 
be  used.  In  fact,  it  is  unfortunate  but  true  that  it  is  impossible  to  make  a general 
statement  about  which  algorithm  is  best.  The  preferred  algorithm  is  the  one  that 
most  quickly  produces  a sample  that  is  representative  of  ir.  If  the  candidate  q is 
chosen  wisely,  Metropolis-Hastings  will  usually  accomplish  this  faster  than  Gibbs. 
On  the  other  hand,  if  one  chooses  q unwisely,  there  will  be  a prohibitive  number  of 
rejections  and  it  will  be  a very  long  time  before  the  output  looks  like  a sample  from 
7 r.  Deciding  which  MCMC  algorithm  to  choose  is  further  complicated  by  the  huge 
number  of  hybrid  algorithms  that  can  be  created. 

One  hybrid  that  has  become  quite  standard  is  a Gibbs  sampler  where  one  or  more 
of  the  p “Gibbs  steps”  are  replaced  by  Metropolis-Hastings  steps.  This  hybrid  is 
useful  when,  e.g.,  at  least  one  of  the  full  conditional  densities  is  of  a non-standard 
form.  This  approach  is  illustrated  in  the  next  example  where  we  consider  a Bayesian 
version  of  the  Normal  means  model  with  an  improper  prior  specification.  The  prior 
that  we  use  is  the  standard  uniform  shrinkage  prior  and  is  a special  case  of  the  default 
priors  recently  developed  by  Natarajan  and  Kass  (2000). 

Example  1.2.3  Let  y = ( yi,y2 , . . . ,Vk)T  denote  a vector  of  observed  data  that  we 
assume  came  from  the  following  model.  Conditional  on  b = ( bi,b2 , . . . ,bk)T  and  /?, 
the  components  of  y are  independent  and  yj\b,/3  ~ N(/3  + bj,  <j))  where  4>  is  known. 
Conditional  on  6,  the  components  of  b are  iid  N(0,  9).  Finally,  /3  and  6 are  assumed 
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a priori  independent  and  we  put  a flat  prior  on  P\  that  is,  /(/?)  a 1,  and  our  prior 
for  9 is  given  by 


1TKXJS2  if  > 0 
(*+*) 

0 otherwise  . 

where  / denotes  a generic  density.  The  posterior  density  takes  the  form 


n(b,P,e\y) 


f(y\b,P)f(b\6)f(f3)f(6) 
JfJf(y\b,p)f(b\9)f(P)f(6)dbdpd9  ' 


Suppose  we  require  the  posterior  expectation  of  some  function  h(6, /?,#);  that  is,  we 
need 


E^h  — SSSh(b'  P,  6)  7 r(b,  P,  6\y)  db  dp  d9  . 


The  following  MCMC  algorithm  could  be  used  to  obtain  an  approximate  sample  from 
7r  that  could  be  used  to  approximate  Enh.  The  sample  will  be  denoted  by 


{(6®/?®, 0«>):i  = O n}  . 


Consider  using  the  block  Gibbs  sampler  in  which  all  of  the  components  of  b are 
updated  simultaneously,  but  univariate  updates  are  used  for  9 and  p.  (It  is  possible 
to  use  univariate  updates  for  the  6,’s,  but  it  is  generally  good  to  do  block  updates 
whenever  possible.)  In  order  to  run  this  MCMC  algorithm,  we  require  the  densities 
associated  with  b\P,9,y,  P\b,9,y,  and  9\b,P,y.  Let  y = \ 'YhVi  and  b = £ 
Straightforward  calculations  show  that 

and  b\P,  9 , y turns  out  to  have  a fc-dimensional  Normal  distribution.  Specifically, 
letting  I*,  denote  a /c-dimensional  identity  matrix,  we  have 


b\/3,e,y~Nt(„,V) 
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where 


and 


(Vi-P)0  (2/2  -p)9  {yk~P)0\T 

0 + 6 ’ 0 + 6 0 + 6 ) 


06 

V = ^-ik . 


0 + 6 

Finally,  the  density  associated  with  0\b,P,y  is  of  the  following  non-standard  form 
f{0\b,p,y)  = c{0  + 6)~2^  ex 


where  c is  an  unknown  normalizing  constant. 

Instead  of  trying  to  figure  out  how  to  make  draws  from  this  non-standard  den- 
sity, we  will  replace  the  Gibbs  step  for  0 with  a Metropolis-Hastings  step.  Hence, 
we  are  moving  from  the  Gibbs  sampler  to  a hybrid  MCMC  algorithm.  Making  no 
claims  about  optimality,  we  propose  updating  0 using  an  independence  sampler  with 
candidate  density  given  by 


<?(«) 


(iE‘=1<r‘r 


exp 


•iv  b) 
2 0^  J 

j= 1 


(1.4) 


Making  draws  from  q is  straightforward  as  it  is  an  Inverted  Gamma  density.  If  we 
let  0b)  denote  the  current  value  of  the  chain  and  0 the  proposed  move,  then  the 
acceptance  probability  for  this  step  is 

'0^  + 6k  2 


+ 4> 


,1 


Now  putting  all  of  this  together,  we  have  the  following  hybrid  sampler. 

1.  Choose  an  arbitrary  starting  value  (6^0,^00^000)  and  set  i = 0. 

2.  Set  b*  = | Y^kj= 1 b^  and  generate  /?b+i)  ~ N (y  — b *,  |). 

3.  Generate  6(*+b  ~ Nk(fi*,V*)  where 

* f(yi-  /3{i+1))0(i)  (2/2  - /3(i+1))0b)  (yk  - /3(<+1))0b)\ T 

M v 0(O  + ’ 0®  + <t>  ’ ' ’ 0{<)  + 0 y 
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and 


4.  Generate  9 from  the  density  q in  (1.4)  (with  the  b^+1'l,s  in  place  of  the  6/s). 
Generate  U ~ Uniform(0, 1).  If  U is  less  than 


then  set  0(*+1)  = 6,  otherwise  set  0h+i)  = Put  i = i + 1.  (It  is  possible  to 
use  more  than  one  Metropolis  iteration  in  this  step,  however,  it  is  unnecessary.) 
5.  If  i < n,  return  to  Step  2;  otherwise  stop. 

Note  that  there  is  nothing  special  about  updating  /3,  then  b,  then  8.  Any  per- 


The output  from  this  algorithm  can  be  used  to  construct  the  following  estimate 
of  E^h 


Before  moving  on  to  the  final  section  concerning  implementation  issues,  we  briefly 
describe  perfect  (or  exact)  sampling.  The  reason  for  using  MCMC  is  that  getting  an 
iid  sample  from  i r via  the  usual  means  (e.g.,  rejection  sampling)  is  either  impossible 
or  extremely  inefficient.  Perfect  sampling  is  a way  of  using  an  MCMC  algorithm 
to  produce  iid  samples  from  n.  The  original  perfect  sampling  algorithms  (see,  e.g., 
Propp  and  Wilson,  1996  and  Fill,  1998)  were  designed  for  situations  where  X contains 
a large  but  finite  number  of  points.  In  general,  these  algorithms  require  the  user  to  run 
as  many  Markov  chains  as  there  are  points  in  X.  While  there  are  some  extensions  to 
continuous  and  unbounded  state  spaces,  these  require  sophisticated,  problem-specific 
techniques  to  reduce  the  number  of  chains  under  consideration  to  a finite  number 


mutation  of  this  order  would  work.  Also,  note  that  a starting  value  for  (3^  is  not 
required. 
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(Murdoch  and  Green  1998).  Consequently,  perfect  sampling  has  remained  restricted 
to  a small  number  of  statistical  applications. 

1.3  General  Implementation  Issues 

Once  we  have  chosen  which  MCMC  algorithm  will  be  employed,  there  are  still 
several  operational  issues  of  concern.  Some  standard  questions  are  (1)  How  should  we 
choose  the  starting  values?  (2)  How  many  chains  should  we  use?  (3)  When  should  the 
sampling  begin?  (4)  When  should  the  sampling  stop?  (5)  Is  there  software  available? 

(1)  The  choice  of  starting  values  is  irrelevant  if  we  run  the  chain  long  enough.  That 
is,  it  is  valid  to  use  ergodic  averages  to  estimate  the  desired  expectation  no  matter 
what  starting  point  is  chosen.  However,  trying  a few  preliminary  runs  with  dispersed 
starting  values  can  help  protect  against  starting  in  a local  mode.  When  this  occurs 
the  chain  can  get  “stuck”  for  long  periods  and  thus  require  many  iterations  to  explore 
the  support  of  7 r. 

(2)  The  number  of  chains  to  be  employed  remains  a somewhat  unsettled  issue.  Gel- 
man  and  Rubin  (1992),  Geyer  (1992),  and  Raftery  and  Lewis  (1992)  provide  a thor- 
ough discussion.  Those  that  advocate  multiple  independent  chains  with  dispersed 
starting  points  argue  that  this  is  the  only  way  to  ensure  that  the  support  of  ir  has 
been  thoroughly  explored.  Of  course,  this  can  require  substantial  computational  re- 
sources. On  the  other  hand,  if  the  algorithm  is  fairly  well  behaved  then  a single  chain 
should  explore  the  support  of  the  target  density  rapidly. 

(3)  This  is  the  issue  of  burn-in,  that  is,  the  idea  that  we  should  discard  the  first  B , say, 
iterations  of  the  algorithm  so  that  the  Markov  chain  will  “forget”  the  starting  value. 
Actually,  hn  is  still  a consistent  estimator  of  Enh  if  there  is  no  burn-in;  i.e.,  B = 0. 
However,  if  burn-in  is  used  then  the  preferred  method  is  to  determine  B before  any 
simulation  starts  (Rosenthal,  1995a,  Roberts  and  Tweedie,  1999).  Unfortunately,  this 
can  be  a difficult  task  for  a realistic  model.  The  alternative  to  a priori  calculation  of 
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B is  the  so-called  convergence  diagnostics  which  try  to  determine  when  to  stop  burn- 
in  and  start  sampling  based  on  the  output  of  the  algorithm.  Brooks  and  Roberts 
(1998),  Cowles  and  Carlin  (1996),  and  Robert  and  Casella  (1999)  contain  thorough 
discussions  about  the  use  of  convergence  diagnostics. 

(4)  A reasonable  way  to  decide  when  to  stop  sampling  is  to  wait  until  the  ergodic 
average  hn  reaches  some  prespecified  level  of  accuracy.  Just  as  with  iid  sampling,  if  a 
central  limit  theorem  holds  for  hn  then  we  can  construct  sensible  confidence  intervals 
for  hn.  That  is,  we  can  say  that  the  estimate  of  Enh  is  sufficiently  accurate  if  the 
width  of  the  confidence  interval  is  small. 

In  order  to  calculate  this  confidence  interval  it  is  crucial  to  obtain  a good  estimate 
of  the  variance,  o\  say,  of  the  asymptotic  Normal  distribution  of  hn.  Of  course,  this  is 
complicated  by  the  fact  that  MCMC  algorithms  produce  dependent  sequences.  The 
simplest  way  of  estimating  is  by  the  method  of  batch  means  (Geyer  1992).  An 
alternative  is  the  regenerative  simulation  method  proposed  by  Mykland,  Tierney  and 
Yu  (1995)  and  Robert  (1995). 

(5)  BUGS  (Bayesian  Inference  Using  Gibbs  Sampling)  (Spiegelhalter,  Thomas  and 
Best  1999)  is,  to  our  knowledge,  the  only  software  package  that  allows  one  to  analyze 
a wide  range  of  models  via  MCMC.  As  its  name  suggests,  this  package  focuses  on 
Gibbs  sampling  but  it  will  include  a Metropolis-within-Gibbs  step  if  a non-standard 
full  conditional  density  is  encountered.  BUGS  may  be  freely  obtained  via  the  World 
Wide  Web  at  http://www.mrc-bsu.cam.uk/bugs/welcome.shtml. 

Free  software  is  also  available  that  allows  implementation  of  convergence  diag- 
nostics and  aids  in  the  analysis  of  MCMC  output.  CODA  (Convergence  Diagnosis 
and  Output  Analysis)  was  developed  specifically  for  handling  BUGS  output  and  may 
be  obtained  from  the  same  web  site  as  BUGS.  Another  option  is  BOA  (Bayesian 
Output  Analysis)  which  is  more  flexible  than  CODA  and  may  be  obtained  from 
http : / /www . public-health . uiowa . edu/BOA/. 


CHAPTER  2 

HONEST  EXPLORATION  OF  INTRACTABLE  PROBABILITY 
DISTRIBUTIONS  VIA  MARKOV  CHAIN  MONTE  CARLO 

2.1  Introduction 

2.1.1  The  Questions 

During  the  decade  or  so  since  the  appearance  of  the  seminal  paper  by  Gelfand 
and  Smith  (1990),  Markov  chain  Monte  Carlo  (MCMC)  methods  have  revolution- 
ized statistical  computing.  While  the  Bayesians  have  certainly  made  the  most  use  of 
MCMC,  applications  have  popped  up  in  many  different  areas  of  statistics.  For  exam- 
ple, MCMC  techniques  can  be  used  to  calculate  P-values  in  exact  conditional  infer- 
ence (Diaconis  and  Sturmfels  1998)  and  to  maximize  intractable  likelihood  functions 
associated  with  generalized  linear  mixed  models  (McCulloch  1997).  Furthermore,  the 
popularity  of  the  BUGS  (Bayesian  inference  Using  Gibbs  Sampling)  software  package 
(Spiegelhalter  et  al.  1999)  indicates  that  MCMC  is  used  routinely  in  applied  work. 
An  excellent  introduction  to  MCMC  is  the  book  edited  by  Gilks  et  al.  (1996). 

MCMC  methods  allow  us  to  circumvent  the  difficulties  associated  with  drawing 
random  samples  from  complex,  high-dimensional  probability  distributions.  Unfor- 
tunately, using  a Markov  chain  instead  of  a random  sample  creates  new  problems 
that  must  be  solved  before  we  can  use  MCMC  methodology  with  the  same  level  of 
confidence  that  we  have  in  classical  Monte  Carlo  methods.  Our  goal  in  this  chapter 
is  to  spell  out  exactly  what  these  new  problems  are  and  to  explain  how  they  can  be 
solved. 

We  begin  with  classical  Monte  Carlo  integration.  Suppose  we  want  to  know  the 
value  of  Eng  :=  f g(x)  n(x)  dx , where  n is  a probability  density  and  g is  some 
function,  but  this  integral  cannot  be  evaluated  analytically.  (Assume  that  Eng2  is 
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finite.)  Classical  Monte  Carlo  integration  requires  an  independent  and  identically 
distributed  (iid)  sample  Xi,  X2,  ■ ■ ■ from  n.  By  the  Strong  Law  of  Large  Numbers, 
with  probability  1, 


9n 


~-f 'g(X,)^E„g 
n ' 

i=i 


as  n — » oo. 


(2.1) 


Moreover,  since  the  second  moment  is  finite,  there  is  a central  limit  theorem  (CLT) 
for  gn\  that  is,  y/n(gn  — Eng)  -4  N(0,cr2)  and,  of  course,  the  usual  sample  variance 
of  the  g(XiY s is  an  unbiased  estimate  of  a2.  Hence,  it  is  simple  to  construct  a Monte 
Carlo  standard  error  for  gn  and  this  can  be  used  to  decide  upon  a reasonable  value 
of  n. 

The  fundamental  idea  underlying  MCMC  is  that  even  when  obtaining  iid  draws 
from  7T  is  prohibitively  difficult,  it  may  still  be  possible  to  simulate  a Markov  chain, 
Xq,  X\,  X^, . . . that  has  stationary  density  n.  Let  /„  denote  the  density  of  X*n. 
Under  a few  simple  regularity  conditions  (described  in  Section  2.2)  X*  converges  in 
total  variation  to  a random  variable  from  7 r;  that  is, 


- J I fn(x)  ~ 7r(s)  | dx  | 0 as  n — > oo  . 

(This  is  much  stronger  than  convergence  in  distribution.)  More  importantly  for  us, 
these  same  regularity  conditions  imply  that  the  Ergodic  Theorem  holds  and  hence 
with  probability  1, 

1 n_1 

9n  :=  - 9^X*^  E*g  as  n ->  oo  . (2.2) 

Tt 

i=0 

Unfortunately,  Eng2  < oo  is  no  longer  enough  to  guarantee  a CLT  (Meyn  and  Tweedie 
1993,  Chapter  17).  Indeed,  sufficient  conditions  for  a CLT  involve  the  convergence 
rate  (or  mixing  properties)  of  the  Markov  chain  and,  as  we  will  see,  these  conditions 
can  be  quite  difficult  to  verify  in  practice. 
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Whenever  the  MCMC  method  is  employed,  the  user  should  give  serious  thought 
to  the  following  two  questions:  (Ql)  When  should  sampling  begin?  That  is,  how  long 
does  it  take  the  Markov  chain  to  get  sufficiently  close  to  the  stationary  distribution; 
i.e.,  what  is  an  appropriate  burn-in ? (Q2)  How  long  should  the  sampling  continue 
after  burn-in?  That  is,  how  do  we  know  when  the  estimates  based  on  the  output 
are  sufficiently  accurate  or,  put  another  way,  what  are  the  standard  errors  of  the 
estimates? 

Observe  that  having  to  deal  with  (Ql)  and  (Q2)  is  a “new”  problem  in  the  sense 
that,  when  it  is  possible  to  make  iid  draws  from  n,  (Ql)  is  moot  and  (Q2)  is  easy.  In 
most  practical  applications  of  MCMC,  (Ql)  and  (Q2)  are  not  rigorously  addressed. 
Instead,  a mixture  of  intuition,  experience,  and  ad  hoc  methods  is  used  to  determine 
the  amount  of  burn-in  and  the  accuracy  of  the  resulting  estimates.  One  has  to 
wonder  how  this  affects  the  quality  of  any  subsequent  inferences.  In  this  chapter,  we 
will  explain  how  to  develop  rigorous  answers  to  (Ql)  and  (Q2). 

The  astute  reader  may  be  thinking:  If  (2.2)  holds  for  any  starting  point  and  if 
it  is  possible  to  calculate  a valid  standard  error  for  <7*  using  a non-stationary  chain, 
then  why  bother  with  burn-in?  Actually,  there  are  many  who  do  think  that  there  is 
no  need  for  burn-in  in  such  a situation.  For  example,  see  Chapter  6 of  Ripley  (1987) 
or  Chapter  3 of  Bratley,  Fox  and  Schrage  (1987).  (Also,  see  C.  J.  Geyer’s  web  page 
at  http://www.stat.umn.edu/~charlie/  for  more  on  this  point  of  view.) 

2.1.2  Honest  Answers 

In  this  chapter  we  consider  (what  is  currently)  the  most  straightforward  method  of 
developing  rigorous  answers  to  (Ql)  and  (Q2).  This  method  is  applicable  only  when 
the  underlying  Markov  chain  converges  to  its  stationary  distribution  at  a geometric 
rate;  i.e.,  when  the  chain  is  geometrically  ergodic  (Meyn  and  Tweedie  1993,  Chapter 
15).  Generally  speaking,  geometrically  ergodic  chains  are  “good”  in  the  sense  that 
they  can  be  expected  to  quickly  produce  output  that  is  similar  to  what  one  would  get 
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by  sampling  directly  from  the  target  distribution.  The  following  example,  introduced 
by  Roberts  and  Rosenthal  (1998a),  clearly  illustrates  the  potential  difference  between 
geometric  and  subgeometric  (slower  than  geometric)  convergence. 

Example  2.1.1  Suppose  the  target  distribution  is  Exp(l);  that  is,  tt(x)  — e~xI(x  > 
0).  Consider  an  independence  Metropolis  sampler  with  an  Exp(0)  proposal;  i.e.,  the 
proposal  density  is  p(x)  = 9e~6xI(x  > 0).  The  chain  evolves  as  follows:  Let  the 
current  state  be  Xn  = x.  Draw  y ~ Exp(0)  and  set  Xn+\  = y with  probability 

min{^|ijp(s)’1} =exp{(z-  »)(1  -fl)}  a 1 . 

otherwise,  set  Xn+i  = x.  A more  algorithmic  way  to  think  of  this  is  as  follows:  Draw 
y ~ Exp($)  and  independently  draw  u ~ Uniform(0, 1).  If  u < exp{(x  - y)(  1 - 0)} 
then  set  Xn+i  — y,  otherwise  set  Xn+\  = x.  (See  Chib  and  Greenberg  (1995)  for 
a nice  introduction  to  the  Metropolis-Hastings  algorithm  and  Billera  and  Diaconis 
(2001)  for  an  interesting  geometric  interpretation  of  the  algorithm.) 

Note  that  if  9 = 1,  this  algorithm  provides  iid  draws  from  the  target  distri- 
bution. Results  in  Mengersen  and  Tweedie  (1996)  can  be  used  to  show  that  the 
chain  is  geometrically  ergodic  if  0 < 9 < 1 and  subgeometric  if  6 > 1.  (See  Sub- 
section 2.3.2  for  more  details.)  The  problem  in  the  6 > 1 case  is  that  the  tails 
of  the  proposal  density  are  too  light  relative  to  the  target  density.  This  makes  it 
difficult  for  the  chain  to  reach  larger  values  in  the  state  space  and,  when  it  does, 
it  tends  to  get  stuck  there  for  long  periods.  See  J.  S.  Rosenthal’s  web  page  at 
http://markov.utstat.toronto.edu/jeff/java/exp.html  for  a graphical  illus- 
tration of  this  Markov  chain  for  several  different  values  of  6. 

Let  X0,  Xi,  X2, . . . denote  this  independence  Metropolis  sampler  with  9 = 0.5 
and  starting  value  X0  = 1.  We  ran  10,000  independent  copies  of  this  chain  for  15 
iterations  in  order  to  see  how  close  the  distribution  of  Xi5  is  to  Exp(l).  (It  is  actually 
easy  to  get  an  exact  upper  bound  on  the  total  variation  distance  between  Xi5  and 
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Exp(l);  see  Subsection  2.3.2.)  A histogram  of  the  10,000  iid  copies  of  is  given  in 
Figure  2.1  along  with  (an  appropriately  scaled  version  of)  the  Exp(l)  density.  This 
plot  suggests  that  the  distribution  of  X15  is  very  close  to  Exp(l)  and  hence  that 
this  Markov  chain  converges  quickly.  We  performed  the  same  experiment  for  the 
subgeometric  chain  corresponding  to  9 = 4,  and  the  results  are  given  in  Figure  2.2. 
Judging  from  the  plot,  this  Markov  chain  is  still  quite  far  from  stationarity  after  15 
iterations.  The  spike  in  the  histogram  at  the  value  1 is  due  to  the  fact  that  in  many 
of  the  10,000  runs,  the  chain  was  stuck  at  the  starting  value  for  all  15  iterations.  In 
the  third  and  last  performance  of  the  experiment,  we  ran  the  subgeometric  chain  for 
1,000  iterations  instead  of  just  15.  The  results  are  given  in  Figure  2.3.  It  appears 
that  the  distribution  of  X10oo  is  still  quite  far  from  Exp(l).  Note,  in  particular,  the 
discrepancies  near  0 and  3.  Only  2 of  the  10,000  Xiooo’s  were  larger  than  4.  In  a 
random  sample  of  size  10,000  from  the  Exp(l)  distribution,  we  expect  to  see  about 
180  observations  larger  than  4. 

Of  course,  not  all  geometrically  ergodic  chains  converge  as  quickly  as  the  9 = 0.5 
chain,  and  not  all  subgeometric  chains  behave  as  badly  as  the  9 = 4 chain.  Indeed, 
when  9 is  just  a little  larger  than  1,  this  independence  Metropolis  sampler  works 
pretty  well.  However,  in  practical  applications  of  MCMC,  the  user  does  not  have  the 
luxury  of  comparing  the  distribution  of  Xn  to  the  stationary  distribution  for  various 
values  of  n.  Establishing  geometric  ergodicity  provides  the  user  with  “peace  of  mind” 
concerning  the  mixing  rate  of  the  Markov  chain,  and,  as  we  will  see,  allows  the  user 
to  formally  answer  (Ql)  and  (Q2). 

The  method  described  herein  for  developing  exact  answers  to  (Ql)  and  (Q2) 
requires  that  one  establish  a drift  condition  and  an  associated  minorization  condition 
for  the  underlying  Markov  chain,  which  together  imply  geometric  ergodicity  (Meyn 
and  Tweedie  1993,  Chapters  15  & 16).  We  now  describe  how  drift  and  minorization 
can  be  used  to  give  rigorous  answers  to  (Ql)  and  (Q2). 
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Histogram  of  10,000  iid  copies  of  when  9 = 0.5  and 

*o  = l 


Figure  2.1:  The  independence  Metropolis  algorithm  with  6 = 0.5  was  run  for  15 
iterations  10,000  times.  In  each  case,  the  starting  value  was  X0  — 1.  This  is  a 
histogram  of  the  10,000  iid  copies  of  Xi5.  The  solid  line  is  (an  appropriately  scaled 
version)  of  the  stationary  density. 
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Histogram  of  10,000  iid  copies  of  Xi5  when  9 — 4 and  Xq  = 1 


0 1 2 3 4 5 6 

Figure  2.2:  The  independence  Metropolis  algorithm  with  9 — 4 was  run  for  15  itera- 
tions 10,000  times.  In  each  case,  the  starting  value  was  X0  = 1.  This  is  a histogram 
of  the  10,000  iid  copies  of  Xi5.  The  solid  line  is  (an  appropriately  scaled  version)  of 
the  stationary  density. 
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Histogram  of  10,000  iid  copies  of  Xiooo  when  9 = 4 and 

X0  = l 


Figure  2.3:  The  independence  Metropolis  algorithm  with  6 — 4 was  run  for  1,000 
iterations  10,000  times.  In  each  case,  the  starting  value  was  X0  = 1.  This  is  a 
histogram  of  the  10,000  iid  copies  of  X10oo-  The  solid  line  is  (an  appropriately  scaled 
version)  of  the  stationary  density. 
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Once  drift  and  minorization  have  been  established,  the  results  of  Rosenthal  (1995a) 
or  Roberts  and  Tweedie  (1999)  can  be  employed  to  calculate  exactly  how  many  it- 
erations are  necessary  to  get  within  a prespecified  (total  variation)  distance  of  the 
target  distribution.  In  other  words,  we  can  find  an  n!  such  that 

^ J | /„/(*)  - 7r(*)  | dx  < 0.01,  say. 

The  value  n'  is  an  honest  answer  to  (Ql). 

Typically,  the  characteristics  of  the  target  distribution  that  we  desire  are  esti- 
mated using  ergodic  averages  like  (2.2).  As  mentioned  above,  a finite  second  moment 
guarantees  a CLT  in  the  iid  case,  but  is  not  sufficient  when  using  a Markov  chain. 
Chan  and  Geyer  (1994)  have  shown,  however,  that  geometric  ergodicity  of  the  Markov 
chain  together  with  (a  bit  more  than)  a finite  second  moment  guarantees  the  following 
CLT 

v'5(£--E,9)4  N(0,<r,2).  (2.3) 

It  is  important  to  recognize  that  gn  and  are  quite  different  estimates  of  Eng  and 
hence  in  general  yf  a2.  Moreover,  the  usual  sample  variance  of  the  g(X*)' s will  not 
generally  be  a consistent  estimate  of  a*.  Fortunately,  a consistent  estimate  of  <r*  can 
be  formed  using  the  regenerative  simulation  technique,  which  requires  a minoriza- 
tion condition  (Mykland  et  al.  1995,  Robert  1995).  Hence,  we  are  able  to  calculate 
asymptotic  standard  errors  for  the  estimates  based  on  the  first  n,  say,  iterations,  and 
then  continue  to  run  the  chain  if  these  standard  errors  are  unacceptably  large.  Thus, 
we  are  able  to  provide  an  honest  answer  to  (Q2). 

2.1.3  An  Example 

In  Section  2.6  we  illustrate  a realistic  application  of  the  methods  described  in  this 
chapter  by  conducting  a detailed  study  of  a Gibbs  sampler  for  a Bayesian  hierarchical 
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model.  We  now  describe  this  model  and  give  the  reader  a taste  of  what  can  be 
accomplished  using  these  methods. 

The  model,  which  we  refer  to  as  (A4),  is  a Bayesian  version  of  the  standard, 
Normal  theory  one-way  random  effects  model  with  conjugate  priors.  It  has  three 
levels.  First,  conditional  on  6 = (9\, . . . , 9K)T  and  Ae,  the  data  values,  Yij,  are 
independent  with 

Yij\6,Xe  ~ N(0i,  A71) 

where  i = 1, . . . , K and  j = 1, . . . , m.  At  the  second  stage,  conditional  on  y and  A g, 
6 and  Xe  are  independent  with 

9 \ii,  A g ~ N(//l,  A^I)  and  Ae  ~ Gamma(a2,  b2), 

where  1 is  a K x 1 column  vector  of  ones,  I is  a K x K identity  matrix,  and  a2 
and  b2  are  known  positive  constants.  (We  say  W ~ Gamma(a,  /3)  if  its  density  is 
proportional  to  wa~le~w^I{w  > 0).)  Finally,  at  the  third  stage,  n and  A g are  assumed 
independent  with 

/i  ~ N(/x0,  Aq  1)  and  Xg  ~ Gamma(ai,  bi) 

where  and  //0,  A0,  a\  and  b\  are  known  constants;  all  but  /i0  are  assumed  to  be  strictly 
positive  so  that  all  of  the  distributions  are  proper.  Also,  we  assume  that  K > 3 and 
that  m > 2.  Note  that  the  standard,  Normal  theory  one-way  random  effects  model 
(Searle,  Casella  and  McCulloch  1992,  Chapter  3)  corresponds  to  viewing  /i,  Ae  and 
Xg  as  fixed  and  unknown.  Let  & — m~l  YaL  1 Vij- 

The  posterior  density  corresponding  to  model  (A4)  is  characterized  by 

tt(0,  /x,  Ae,  Xg\y)  a f(y\0,  Xe)  f{0\n,  Xg)  /( Ae)  f(y)  f(Xg)  (2.4) 

where  y is  a vector  containing  all  of  the  data,  and  / denotes  a generic  density.  The 
integrals  required  for  inferences  through  this  posterior  are  not  available  in  closed  form. 
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Thus,  we  might  resort  to  MCMC  techniques  like  the  Gibbs  sampler.  Indeed,  variance 
component  models  have  been  advocated  as  an  ideal  application  for  the  Gibbs  sampler 
(Gelfand  and  Smith  1990,  Gelfand,  Hills,  Racine-Poon  and  Smith  1990). 

The  data  in  Table  2.1  were  simulated  according  to  model  (M)  with  K — 6,  m — 8, 
ax  = a2  = b\  = &2  = A0  = 1 and  /x0  = 0.  We  now  pretend  that  the  origin  of  the  data 
is  unknown  and  that  we  desire  the  posterior  expectations  of  A g and  Ae  under  three 
different  priors;  that  is,  under  three  different  hyperparameter  settings.  The  three 
settings  are  given  in  Table  2.2.  Note  that  setting  #1  agrees  exactly  with  the  values 
used  to  simulate  the  data;  i.e.,  it  is  the  “correct”  prior.  Setting  #2  is  a “diffuse” 
prior. 

A block  Gibbs  sampler  (described  in  Section  2.6)  will  be  used  to  estimate  the  six 
posterior  expectations.  For  starting  values,  we  use  Qi  = & and  [i  — y.  Due  to  the 
structure  of  our  block  Gibbs  sampler,  starting  values  for  \g  and  Ae  are  not  required. 

Table  2.1:  Simulated  Data 


Cell 

1 2 3 4 5 

6 

Vi 

-0.22795  -1.1913  0.030547  0.48428  0.036639 

-0.026581 

Mt  = mK  = 48 

y = A/f 1 £'=i  £'=1  Vij  = -0.14906 
SSE  = ZL  E U(V‘i  - w<)2  = 23-251 

Table  2.2:  Three  Different  Prior  Specifications 


Hyperparameter 

Setting 

Oi 

h 

a2 

h 

•^0 

1 

1 

l 

1 

1 

0 

1 

2 

0.1 

0.1 

0.1 

0.1 

0 

0.1 

3 

3 

7 

6 

3 

0 

1 

We  first  address  (Ql)  by  finding  an  n'  such  that  | f | /„/  — tt  \ < 0.01,  where 
fn  denotes  the  marginal  density  of  the  nth  iterate  of  the  block  Gibbs  sampler  and 
7 r denotes  the  posterior  density  in  (2.4).  The  results  are  given  in  Table  2.3.  For 
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example,  under  the  first  prior,  after  4300  iterations  of  the  block  Gibbs  sampler,  the 
total  variation  distance  to  stationarity  is  at  most  0.0092.  (The  formula  used  to  find 
n'  is  given  in  Section  2.4.)  These  burn-ins  are  quite  manageable  considering  that  it 
takes  only  about  2 minutes  to  run  1 million  iterations  of  our  block  Gibbs  sampler  on 
a fast  PC.  Unfortunately,  as  we  show  in  Section  2.6,  things  do  not  always  work  out 
this  nicely.  Now  on  to  (Q2). 

Table  2.3:  Total  Variation  Bounds  for  the  Simulated  Data 


Hyperparameter 

Setting 

Iterations  (n') 

Bound 

1 

4300 

0.0092 

2 

150000 

0.0093 

3 

900 

0.0086 

Table  2.4  contains  point  estimates,  the  of’s,  and  asymptotic  95%  confidence  in- 
tervals for  the  posterior  expectations  of  A g and  Ae.  These  were  obtained  via  the 
regenerative  method,  which  requires  no  burn-in.  Standard  textbooks  on  simulation 
(e.g.,  Bratley  et  al.  1987)  consider  regenerative  simulation  to  be  the  preferred  method 
for  obtaining  confidence  intervals  from  simulation  output.  [See  Section  2.5  for  a de- 
tailed explanation  of  the  regenerative  simulation  method.) 

Table  2.4:  Point  Estimates  and  Asymptotic  95%  Confidence  Intervals  for  E(\g\y) 
and  E(\e\y) 


Setting 

Parameter 

Estimate 

95%  Cl 

1 

A e 

2.065 

0.378 

(2.052,  2.078) 

Ae 

1.754 

0.038 

(1.749,  1.759) 

2 

Ae 

4.229 

4.543 

(4.210,  4.248) 

Ae 

1.790 

0.027 

(1.789,  1.791) 

3 

Afl 

0.711 

0.026 

(0.708,  0.714) 

Ae 

1.856 

0.040 

(1.853,  1.859) 

Reported  in  Table  2.5  are  the  number  of  regenerations  upon  which  the  variance 
estimates  in  Table  2.4  are  based,  and  the  mean  number  of  iterations  per  regeneration. 
Roughly  speaking,  the  more  often  a Markov  chain  regenerates,  the  faster  it  converges. 
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Hence,  it  appears  that  the  chain  corresponding  to  setting  #3  mixes  the  fastest  and 
the  chain  associated  with  setting  #2  mixes  the  slowest.  Note  that  this  is  exactly 
what  we  would  have  guessed  based  on  Table  2.3. 

Table  2.5:  Regenerative  Behavior  of  Block  Gibbs 


Number  of 

Mean  number 

Setting 

regenerations 

of  iter /regen 

1 

9000 

4.7 

2 

50000 

7.4 

3 

14000 

3.8 

The  remainder  of  this  chapter  is  organized  as  follows.  Section  2.2  contains  some 
basic  Markov  chain  background  which  is  illustrated  using  a toy  Gibbs  sampler.  In 
Section  2.3,  drift  and  minorization  are  defined  and  we  demonstrate  the  type  of  calcu- 
lations required  to  establish  these  conditions  using  our  toy  Gibbs  sampler.  Section  2.3 
also  contains  an  heuristic  explanation  of  the  theoretical  connection  between  geomet- 
ric ergodicity  and  drift  and  minorization.  During  this  development,  we  derive  the 
coupling  inequality,  which  is  the  key  result  for  deriving  convergence  rate  bounds  for 
Markov  chains  on  general  state  spaces.  A theorem  of  Rosenthal  (1995a)  that  al- 
lows one  to  use  drift  and  minorization  to  get  exact  upper  bounds  on  the  distance  to 
stationarity  is  stated  in  Section  2.4.  Rosenthal’s  result  is  applied  to  our  toy  Gibbs 
sampler  for  illustration.  In  Section  2.5,  we  explain  how  to  use  regenerative  simulation 
to  calculate  Monte  Carlo  standard  errors.  Section  2.6  contains  another  analysis  like 
the  one  in  Subsection  2.1.3.  However,  in  this  case,  real  data  are  used  and  all  the  de- 
tails are  given.  The  data  used  in  Section  2.6  concern  styrene  exposure  of  laminators 
at  a boat  manufacturing  plant  (Lyles,  Kupper  and  Rappaport  1997).  Many  of  the 
more  technical  details  have  been  relegated  to  Section  2.7. 

2.2  Markov  Chain  Background 

This  section  consists  of  two  subsections.  In  Subsection  2.2.1,  we  develop  some 
necessary  notation,  briefly  describe  the  basic  convergence  results  for  ergodic  Markov 
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chains,  and  introduce  a toy  example  that  will  be  used  for  illustration  throughout  the 
chapter.  Geometric  ergodicity  and  its  consequences  are  described  in  Subsection  2.2.2. 
More  general  accounts  of  the  material  in  this  section  can  be  found  in  Nummelin 
(1984),  Meyn  and  Tweedie  (1993),  Tierney  (1994),  or  Robert  and  Casella  (1999). 
(Also  see  Section  2.7  for  more  detail.) 

2.2.1  Basics 

Let  X C Rp  for  p > 1 and  let  B denote  the  associated  Borel  cr-algebra.  Suppose 
that 

$ = {X0,XltX2,...} 

is  a discrete  time  Markov  chain  with  state  space  X and  Markov  transition  kernel  P; 
that  is,  for  x 6 X and  A € B, 


P(x,A)  = Pv(Xi+1eA\Xi  = x). 

In  order  to  simplify  the  exposition,  we  will  often  assume  that  the  probability  measure 
P(x,-)  has  a (conditional)  density,  A;(-|®),  with  respect  to  Lebesgue  measure.  That 
is, 

P(x,A)=  / k(u\x)  du  . 

Ja 

We  call  k the  Markov  transition  density.  (All  the  Gibbs  samplers  that  we  discuss 
have  Markov  transition  densities.)  For  n E N :=  {1,2,3,...},  let  Pn  denote  the 
n-step  transition  kernel;  i.e.,  Pn(x,A)  = Pr(Xi+n  G A\X,  = x)  so,  in  particular, 
P = P1.  Note  that  Pn(x,  •)  is  the  distribution  of  the  random  variable  Xn,  conditional 
on  starting  the  chain  at  X0  = x. 

If  7r  is  a density  such  that 

7 r(x)  = / k(x\x')  7r(x')  dx' , 

J X 


(2.5) 
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then  7r  is  called  an  invariant  density  for  the  Markov  chain  $.  Consider  the  significance 
of  equation  (2.5).  Imagine  drawing  x'  from  it  and  then  making  a single  transition 


state  of  the  chain  was  drawn  from  n,  then  the  marginal  density  of  the  next  state  is 
also  7 r.  Consequently,  if  the  Markov  chain  $ is  started  by  taking  X0  ~ 7 r (which  is 
usually  impossible  in  the  MCMC  context),  then  $ is  simply  a sequence  of  dependent 
observations  from  n.  In  other  words,  the  Markov  chain  is  stationary. 

Abusing  notation  slightly,  let  7r  denote  the  probability  measure  associated  with 
the  density  7r  so  7 r(A)  = fA  7r(x)  dx.  The  Markov  chain  $ is  called  7r- irreducible 
if  for  every  x G X and  every  A with  7 r(A)  > 0,  there  exists  an  n G N such  that 
Pn(x,  A)  > 0.  In  words,  $ is  7r-irreducible  if  any  set  with  positive  7r  measure  is 
accessible  from  any  point  in  the  state  space.  We  now  describe  a toy  Gibbs  sampler 
that  will  be  used  for  illustration  several  times  throughout  this  chapter. 

Example  2.2.1  Let  Yi, . . . , Ym  be  iid  N(/i,  9)  and  let  the  prior  for  (p,  9)  be  propor- 
tional to  1/ \[9.  The  posterior  density  is  characterized  by 


as  m > 3 and  we  assume  this  throughout.  Using  the  Gibbs  sampler  to  make  draws 
from  (2.6)  requires  the  full  conditional  densities,  f(p\9,y)  and  f(9\p,y),  which  are 
as  follows 


x1  -»  x according  to  the  Markov  chain.  The  joint  density  of  (x1,  x)  induced  by 
this  recipe  is  exactly  the  integrand  in  (2.5).  Hence,  (2.5)  implies  that  if  the  current 


(2.6) 


where  y — (yi, . . . ,ym)T . It  is  easy  to  check  that  this  posterior  is  proper  as  long 


p\9,y  ~ N {y,9/m) 
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where  y is  the  sample  mean  and  s 2 = YKVi  ~ y)2-  (We  say  W ~ IG(a:, /?)  if  its 
density  is  proportional  to  w~('a+1^e~^^wI(w  > 0).)  Consider  the  Gibbs  sampler  (or 
data  augmentation  algorithm)  that  updates  9 then  y;  that  is,  if  we  let  ( 6 ',  y')  denote 
the  current  state  and  ( 9 , y)  denote  the  future  state,  the  transition  looks  like  (9' , y')  — > 
(9,  y')  — > ( 9 , y).  The  state  space  in  this  case  is  X = K+  x R and  the  Markov  transition 
density  is 

(2.7) 

In  other  words,  the  density  of  the  new  value  ( 9 , y)  given  the  current  state  ( 9 ',  y')  is 
k(9 , y\9',  y').  Simulating  a random  variable  from  this  density  can  be  done  sequentially 
by  first  taking  9 ~ f(9\y',y)  followed  by  y ~ f(y\9,y)  (the  usual  Gibbs  updating 
strategy).  Note  that,  as  is  almost  always  the  case,  Pn  does  not  have  a closed  form. 
Obviously  (2.6)  is  not  intractable  in  any  sense,  so  this  Gibbs  sampler  would  never 
actually  be  used.  However,  its  simplicity  makes  it  ideal  for  demonstrating  the  calcu- 
lation of  drift  and  minorization  conditions  (see  Subsection  2.3.1). 

By  construction,  the  posterior  density  is  invariant  for  the  Gibbs  Markov  chain; 
that  is, 

*0*,  %)  = / /"*(»,  #*!«', »')  «\v)  dp'  d6'  . (2.8) 

J R+ 

The  reader  is  invited  to  verify  that  (2.8)  follows  directly  from  (2.7).  Now,  if  A € B 
is  such  that  'n(A)  > 0,  then  A must  have  positive  Lebesgue  measure.  Thus,  for  any 
( 9',y ')  G X,  we  have 

P[(9',y'),A\=  [ k(9,y\9',y')d(9,y)>  0 
J A 

since  k is  strictly  positive  on  X.  Thus,  the  probability  of  moving  from  any  point  in 
the  state  space  to  any  set  with  positive  n measure  in  one  step  is  positive.  We  conclude 
that  this  Gibbs  Markov  chain  is  7r-irreducible  and  aperiodic.  (See  Subsection  2.7.1 
for  a general  definition  of  aperiodicity.)  We  will  return  to  this  example  later. 
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The  reason  MCMC  methods  are  useful  is  that,  under  simple  regularity  conditions, 
a Markov  chain  will  “converge”  to  its  invariant  distribution  no  matter  how  it  is 

started.  We  will  say  that  satisfies  assumption  (.4)  if  it: 

(i)  possesses  an  invariant  density  (or  probability  measure),  7r; 

(ii)  is  7r-irreducible; 

(iii)  is  aperiodic;  and 

(iv)  is  Harris  recurrent. 

Harris  recurrence  is  a technical  condition  that  is  usually  easy  to  verify  when  (i), 
(ii),  and  (iii)  are  satisfied.  (See  Subsection  2.7.1  for  a general  definition  of  Harris 
recurrence.)  Specific  results  concerning  the  Harris  recurrence  of  Gibbs  samplers  and 
Metropolis-Hastings  chains  can  be  found  in  Tierney  (1994,  Corollaries  1 and  2).  From 
a practical  point  of  view,  assumption  (4)  implies  that  the  starting  value  is  irrelevant 
and  that  the  chain  will  thoroughly  explore  the  state  space  as  the  number  of  iterations 
grows  large.  Under  assumption  (4),  for  every  x G X we  have 

||Fn(x,  •)  — 7r(-)  ||  4.0  as  n — >■  oo,  (2.9) 


where 

\\Pn(x,  •)  - tt(') ||  :=  sup  I Pn(x,  A)  - tt(4)| 

AeB 

is  the  total  variation  distance  between  the  probability  measures  Pn(x,-)  and  7 r(-). 
When  the  probability  measures  have  densities,  the  total  variation  distance  can  be 
expressed  as  one  half  the  integrated  absolute  difference  between  the  densities.  (For  a 
proof  of  this  as  well  as  the  monotonicity  in  (2.9)  look  at  subsection  2.7.2)  In  words, 
(2.9)  says  that  no  matter  what  the  starting  value,  the  random  variables  in  the  Markov 
chain  look  more  and  more  like  a random  variable  from  7r  as  n gets  large.  [Rosenthal 
(2001)  gives  an  accessible  proof  of  this  result.) 
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Assumption  (A)  also  guarantees  that  the  Ergodic  Theorem  holds.  Specifically,  if 
En\h\  f \h(x)  \ 7r (dx)  < oo,  then  for  any  starting  value 

n— 1 


-Y  h(Xi)^Enh 

ntJo 


as  n — ^ oo 


with  probability  1.  Thus,  letting  B E {0, 1, ...  } denote  the  burn-in, 


,.  B+n-l 

Kb  :=  ~ Y h(<X <)  (2-10) 

Tt  _ 
i—B 

is  a strongly  consistent  estimator  of  E^h.  In  this  notation,  our  original  two  questions 
become:  (Ql)  How  large  should  we  take  B ? and  (Q2)  How  large  an  n is  required? 

It  is  important  to  recognize  that  burn-in  is  not  strictly  necessary;  that  is,  using 
B — 0 in  (2.10)  still  results  in  a strongly  consistent  estimator  of  Enh.  Indeed,  one 
feature  of  the  RS  method  discussed  in  Section  2.5  is  that  no  burn-in  is  used;  in  fact, 
X0  is  drawn  from  a prescribed  starting  distribution.  In  contrast,  most  other  variance 
estimation  techniques  (e.g.,  batch  means  and  spectral  analysis)  are  more  effective 
when  the  Markov  chain  is  stationary  (Bratley  et  al.  1987,  p.  94).  Hence,  if  one 
of  these  techniques  is  to  be  employed,  then  it  may  be  necessary  to  use  a non-zero 
burn-in. 


2.2.2  Geometric  Convergence  to  7r  and  its  Connection  to  (Ql)  and  (Q2) 

Assumption  (M)  gets  us  the  convergence  in  (2.9),  but  does  not  tell  us  anything 
about  the  rate  of  convergence.  In  fact,  rigorous  answers  to  (Ql)  and  (Q2)  can  be 
developed  if  it  can  be  established  (through  drift  and  minorization)  that  the  conver- 
gence in  (2.9)  takes  place  at  a geometric  rate.  More  specifically,  suppose  that  the 
Markov  chain  $ satisfies  assumption  (M)  and  in  addition  is  geometrically  ergodic, 
that  is,  there  exists  a constant  0 < t < 1 and  a function  M : X i->-  R+  such  that  for 
any  x G X, 


||P”(z, •) -*(•)!! 


(2.11) 
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for  n = 1,2, (If  there  exists  a bounded  M satisfying  (2.11),  then  <3>  is  called 

uniformly  ergodic,  and  if  X has  a finite  number  of  elements,  then  M is  necessarily 
bounded.) 

Obviously,  if  (2.11)  holds  and  we  have  (or  can  bound)  M(-)  and  t , then  for  a given 
starting  value,  X0  = x,  we  can  calculate  exactly  how  many  iterations  are  necessary 
to  get  the  total  variation  distance  below  some  prespecified  value.  As  we  will  see  later, 
establishing  a drift  condition  and  an  associated  minorization  condition  allows  us  to 
use  the  results  of  Rosenthal  (1995a)  or  Roberts  and  Tweedie  (1999)  to  form  an  upper 
bound  on  the  right-hand  side  of  (2.11).  This  takes  care  of  (Ql). 

We  know  that  for  any  fixed  Be  {0, 1, . . . },  hn,B  is  a strongly  consistent  estimator 
of  E^h.  However,  without  a reliable  measure  of  its  accuracy,  this  estimator  is  not 
very  useful.  Suppose  that  the  following  CLT  holds 

V^(Vb-^)4n(0,^)  . (2.12) 

Then  given  a consistent  estimate  of  al,  we  could  get  an  asymptotic  standard  error 
for  hn,B-  Indeed,  Chan  and  Geyer  (1994)  show  that  if  $ satisfies  assumption  (>4),  is 

geometrically  ergodic,  and  E„\h\2+e  < oo  for  some  e > 0,  then  (2.12)  holds  with 

00 

a2h  = Var7r(h(X0))  + 2 £ Cov,(h(X0),  h(Xi)). 

i= 1 

The  subscript  ‘V’  means  that  the  variances  and  covariances  are  calculated  under 
stationarity;  i.e. , assuming  that  X0  ~ 7 r. 

Mykland  et  al.  (1995)  and  Robert  (1995)  show  that  when  the  CLT  holds  it  is 
possible  to  obtain  a consistent  estimate  of  a2  by  uncovering  regeneration  times ; i.e., 
times  at  which  the  Markov  chain  stochastically  restarts  itself.  This  technique  is 
called  regenerative  simulation  (RS)  and  is  closely  related  to  the  regenerative  method 
that  has  been  discussed  extensively  in  the  operations  research  literature  (see,  e.g., 
Glynn  and  Iglehart  1987).  In  contrast  to  other  methods  of  estimating  al  (see,  e.g., 
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Geyer  1992),  RS  does  not  require  $ to  be  stationary  or  reversible.  (While  standard, 
fixed  scan  Gibbs  samplers  like  the  ones  we  analyze  in  this  chapter  are  not  reversible, 
some  flavors  of  the  Gibbs  sampler  are  reversible  (Besag  et  al.  1995).)  The  use  of 
RS  for  the  purpose  of  estimating  is  described  in  Section  2.5  and  illustrated  in 
Subsection  2.6.4. 

Geometric  ergodicity  is  not  necessary  for  CLTs  (see,  e.g.,  Jarner  and  Roberts 
2000).  On  the  other  hand,  a CLT  may  fail  to  hold  even  in  very  simple  applications  of 
(subgeometric)  MCMC.  For  example,  Roberts  (1999)  shows  that  for  the  independence 
Metropolis  algorithm  of  Example  1,  a CLT  (for  all  functions  h that  are  bounded  away 
from  zero  at  oo)  will  not  hold  if  9 > 2.  In  the  next  Section,  we  define  drift  and 
minorization  and  describe  how  they  can  be  used  to  establish  (2.11),  and  to  formally 
address  (Ql)  and  (Q2). 

2.3  Geometric  Ergodicity  via  Drift  and  Minorization 

This  section  is  broken  up  into  three  subsections.  In  Subsection  2.3.1,  we  define 
drift  and  minorization  and  illustrate  the  required  calculations  with  our  toy  Gibbs 
sampler.  The  theoretical  connection  between  geometric  ergodicity  and  drift  and  mi- 
norization is  outlined  in  Subsections  2.3.2  and  2.3.3.  Subsection  2.3.2  begins  with  a 
description  of  how  a minorization  condition  can  be  used  to  split  the  Markov  transition 
density  into  a mixture  of  two  densities.  This  is  an  important  concept  for  understand- 
ing both  convergence  rate  bounds  and  regenerative  simulation.  Subsection  2.3.2  also 
contains  a recipe  for  using  this  mixture  representation  to  construct  two  copies  of  $ 
that  eventually  couple ; i.e.,  become  the  same  chain.  This  construction  leads  to  the 
coupling  inequality  which  is  the  key  result  for  deriving  convergence  rate  bounds.  In 
Subsection  2.3.3,  the  coupling  inequality  is  used  to  show  how  drift  and  minorization 
together  imply  that  the  Markov  chain  converges  at  a geometric  rate. 
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2.3.1  Definitions  and  Examples 

Throughout  this  section  we  assume  that  the  Markov  chain  $ satisfies  assumption 
(.4.).  We  say  a drift  condition  holds  if  for  some  function  V : X !->•  R+ , some  0 < A < 1, 
and  some  b < oo 


E[V(Xi+1)\Xi  = x]  < XV  (x)  +b  VxeT  (2.13) 

Note  that  this  expectation  is  with  respect  to  the  Markov  transition  kernel  and  not  n. 
It  is  often  useful  to  think  of  V as  a potential  energy  surface.  When  (2.13)  holds,  the 
chain  tends  to  “drift”  towards  states  of  lower  energy  in  expectation.  In  this  context, 
V is  called  an  energy  function.  Here  is  an  example  of  establishing  (2.13). 


Example  2.2.1  continued.  Assume  that  m > 5.  We  shall  establish  a drift  condi- 
tion using  the  function  V (//,  9)  = (y  — y)2.  The  form  of  the  Markov  transition  density 
(2.7)  implies  that  (i)  given  y' , (//,  9)  is  conditionally  independent  of  9';  and  (ii)  given 
9,  y is  conditionally  independent  of  y' . It  follows  that 

E{v(n,t>) \er,it\  = e{v(ii,8)\ii'}  = E{E\v(»,e)  |<UV}  = e{e[vm\6W}. 


Since  /x| 9,  y ~ N (y,  9/m),  the  innermost  expectation  yields 

E[V(v,  0)  |0]  = yf  |«]  = Var(M|«)  = 

lit 


Similarly, 


««-•>  - !l±SfiE 


Therefore, 


m — 3vr"  a'  ' m(m  — 3) 
< A V{y’,9')  + b 


for  b = s2/[m(m  — 3)]  and  any  A > —^3.  This  establishes  (2.13)  since  m > 5. 


36 


A minorization  condition  holds  if  for  some  probability  measure  Q(-)  on  B , some 
set  C for  which  n(C)  > 0,  and  some  e > 0 

P(x,-)>eQ(-)  VzGC.  (2.14) 

In  other  words,  for  all  x G C and  for  all  A e B,  P(x,A ) > eQ(A).  The  set  C is 
called  a small  set.  Note  that  plugging  X into  (2.14)  shows  that  e < 1. 

One  way  to  verify  that  $ is  geometrically  ergodic  is  to  show  that  4>  satisfies  both 
a drift  condition  and  an  associated  minorization  condition.  Specifically,  the  chain  is 
geometrically  ergodic  if  it  satisfies  (2.13)  and  (2.14)  with  C = {x  E X : V(x)  < d} 
and  any  d larger  than  26/(1  — A)  (Rosenthal  1995a).  We  now  demonstrate  how  to 
establish  a minorization  condition,  again  using  our  toy  Gibbs  sampler. 

Example  2.2.1  continued.  We  use  a technique  that  is  based  on  Rosenthal’s 
(1995a)  Lemma  6b.  Let  C = {(0,y)  ■ (y  — V)2  < d}  where  d > 0.  Suppose  that  we 
can  find  a density  q(0 , n)  on  X =M+xR  and  an  e > 0 such  that  whenever  (O',  //)  G C 

fW,y)f(^\e,y)>eq(6,n)  V(0,fi)eX.  (2.15) 

Let  Q(-)  be  the  probability  measure  associated  with  the  density  q.  Then  for  any  set 
A and  any  (O',  //)  G C we  have 

P[(0',n'),A}=  f f(0\n',y)f(n\0,y)  d(0,n)  > e [ q(fJL,9)  d(9,n)  = eQ(A), 

J A J A 

and  hence  (2.14)  is  established.  We  now  construct  a q(0,n)  and  an  e > 0 that  satisfy 
(2.15). 

Let  Cfj  = {n  : (n  — y )2  < d}  and  note  that  for  any  y!  G C M we  have 

fW,y)f(y\0,y)>  f(y\0,y)  inf  f(0\y,y). 

Recall  that  f(0\y,y)  is  just  an  IG  density.  In  fact,  g(0 ) :=  infMecv  f(d\y,y)  can  be 
written  in  closed  form.  (The  infimum  does  actually  depend  on  the  data  and  this  is 
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being  suppressed  in  the  notation.)  Let  IG(a,  0',w)  denote  the  value  of  the  IG(a,/3) 
density  evaluated  at  the  point  w > 0.  A calculation  similar  to  one  done  in  Rosenthal 
(1996)  yields 


g(0)  = inf  IG 


m 


1 s 


m , 

2 +T(M 


y)2-,o 


■ IG(^,f  + ^;^)  6 <9* 

\ IG  y ; #)  9 >9* 


where  9*  = md  [(m  — 1)  log(l  + md/s2)\  1 . Figure  2.4  shows  g(9)  for  the  case  where 
m = 5,  s2  = 10,  and  d = 22/5.  Now  put 


£ = [ [ g(9)f(y\9,y)dyd9  = f g{9)d9. 

J r+  Jr  Jr+ 

Then  (2.15)  is  satisfied  with  this  e and  the  density  q{9,g)  = e~1g(9)f(g\9,y).  Note 
that  e can  be  calculated  with  two  evaluations  of  the  incomplete  gamma  function. 
Since  our  minorization  holds  for  any  d > 0,  we  may  conclude  that  this  Gibbs  sam- 
pler is  geometrically  ergodic  as  long  as  m > 5.  We  will  return  to  this  example  in 
Section  2.4. 

Of  course,  (2.13)  and  (2.14)  may  be  difficult  (if  not  impossible)  to  establish  in 
realistic  settings.  Indeed,  there  is  no  guarantee  that  convergence  occurs  at  a geo- 
metric rate  even  in  simple  applications  of  MCMC  (recall  Example  2.1.1).  Exam- 
ples of  the  use  of  drift  and/or  minorization  for  analyzing  MCMC  algorithms  include 
Robert  (1995),  Rosenthal  (1995,  1996),  Roberts  and  Rosenthal  (1998b),  Hobert  and 
Geyer  (1998),  Jones  and  Hobert  (2001),  and  Hobert  (2001a)  who  considered  Gibbs 
samplers;  Meyn  and  Tweedie  (1994),  Mengersen  and  Tweedie  (1996),  Roberts  and 
Tweedie  (1996),  and  Jarner  and  Hansen  (2000)  who  worked  on  Metropolis-Hastings 
algorithms;  Roberts  and  Rosenthal  (1999a),  Roberts  and  Rosenthal  (1999b)  and  Mira 
and  Tierney  (2001)  who  examined  slice  sampler  Markov  chains. 

In  the  remainder  of  this  section,  we  explain  the  theoretical  connection  between 
geometric  ergodicity  and  drift  and  minorization.  To  accomplish  this  we  need  to 


A graphical  illustration  of  g(9)  :=  inf f(9  |/x,  y) 
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Figure  2.4:  Consider  the  family  of  densities  IG  y + y(/^  ~ y)j  as  A4  ranges 

over  the  set  C^;  that  is,  as  (/i  — y)2  ranges  between  0 and  d.  Suppose,  for  example, 
that  m — 5,  s2  — 10,  and  d = 22/5.  Then  the  shape  parameter  is  2 and  the  scale 
parameter  ranges  between  5 and  16.  Five  of  these  densities,  including  the  extremes, 
are  pictured  above.  The  point  of  intersection  of  the  two  extremes  is  9*  = 4.73.  It  is 
clear  that  one  of  the  extremes  is  always  the  minimum.  Specifically,  when  9 E (0,9*), 
IG(2, 16)  is  below  all  the  others,  while  for  values  above  9* , IG(2, 5)  is  the  smallest. 


introduce  a second  sufficient  condition  for  geometric  convergence  that  involves  a slight 
variation  on  (2.13).  Specifically,  suppose  that  for  some  function  V : A’  i — >■  [1,  oo),  some 
0 < A < 1,  and  some  b < oo  the  Markov  chain  $ satisfies 


E[V(Xi+l) \Xi  = *]  < XV (x)  + blc(x)  V a:  <=  X (2.16) 

where  C = {x  E X : V(x)  < d}  and  d is  any  number  larger  than  ~ 1-  ^ (2-16) 
holds  and  a minorization  condition  is  satisfied  on  C , then  $ is  geometrically  ergodic 
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(Roberts  and  Tweedie  1999).  We  show  in  Subsection  2.7.3  that  when  we  have  (2.13) 
we  will  also  have  a drift  of  the  form  given  in  (2.16). 

2.3.2  Minorization  and  Coupling 

The  concepts  of  splitting  and  coupling  are  explained  in  this  subsection.  We  begin 
with  splitting.  Recall  that  P(x,A)  = JAk(u\x)  du.  For  the  time  being,  we  require 
only  a minorization  condition.  Hence,  assume  that  we  have  established  (2.14)  by 
finding  a set  C , a density  q(- ) on  X,  and  an  e > 0 such  that  whenever  x G (7, 

k(u\x)  > eq(u ) V u G X.  (2-17) 

Note  that  for  each  fixed  x G C, 

. . . k(u\x)  — eqiu) 

r(u\x)  := 

1 — £ 

is  a density  in  u and  is  called  the  residual  density.  It  follows  that,  whenever  x G C, 
we  can  split  k(u\x)  into  a mixture  of  two  densities  as  follows 

k(u\x)  — eq(u)  + (1  — e)  r(u\x)  . (2-18) 

Whenever  Xj  e C , (2.18)  can  be  used  to  generate  Xi+1  sequentially  as  follows.  Given 
Xj  G C,  generate  Si  ~ Bernoulli(e).  If  Si  — 1,  then  draw  X;+i  ~ q(-),  else  draw 
Xi+1~r(.|Xi). 

This  mixture  representation  of  k(u\x)  allows  for  the  (joint)  construction  of  two 
Markov  chains,  = {X0,  Xi,  X2, . . . } and  <f>v  = {Y 0,  Y Y2,  • • ■ },  that,  marginally, 
are  identical  copies  of  $,  but  which  are  not  independent.  In  fact,  $3,  and  are  con- 
structed in  such  a way  that  they  eventually  become  the  same  Markov  chain;  that  is, 
they  eventually  couple.  What  makes  this  possible  is  the  fact  that  the  density  q in 
(2.18)  does  not  depend  on  x.  Here  are  the  details. 

Let  X0  = xQ  be  an  arbitrary,  fixed  starting  value  and  draw  Yq  from  the  invariant 
probability  distribution;  i.e.,  Y0  ~ 7r.  (Note  that  we  will  not  actually  have  to  simulate 
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from  7r).  The  construction  involves  two  different  methods  of  simulating  (Xj+i,  Yj+i) 
conditional  on  ( Xi,Y  i ),  and  which  of  the  two  is  used  depends  on  whether  or  not 
(Xi,Yi)  G C x C.  First,  if  (X, ,Yj)  £ C x C,  then  we  draw  Xi+i  ~ k(-\Xi) 
and  independently  draw  Y i+ 1 ~ k(-\Yi).  If,  on  the  other  hand,  ( Xj,Yj ) G C x C, 
then  we  use  (2.18)  as  follows.  We  draw  <5*  ~ Bernoulli(e).  If  Si  = 0,  then  we  draw 
Xi+i  ~ r(-\Xi)  and  independently  draw  Y i+i  ~ r(-|Yj).  But  if  5*  = 1,  then  we 
draw  Xi+i  = Y i+1  ~ q(-)  and  all  future  draws  are  made  in  such  a way  that  the 
two  chains  remain  equal.  (For  a more  detailed  and  more  general  explanation,  see 
Rosenthal  (1995a).) 

The  coupling  time,  T,  is  defined  to  be  the  (random)  time  at  which  coupling  occurs; 
that  is,  the  time  at  which  the  two  chains  become  the  same.  Lindvall  (1992,  p 24)  uses 
standard  renewal  theory  to  show  that  T is  finite  with  probability  1.  The  key  result 
that  is  used  to  bound  convergence  rates  of  (general  state  space)  Markov  chains  is 
the  so-called  coupling  inequality  which  is  now  derived.  Recall  that  the  total  variation 
distance  between  the  probability  measures  Pn(x0,  •)  and  7r(-)  is  defined  as 

\\Pn(x0i ')  — 7r(')ll  :=sup\Pn(x0,A)  -ir(A)\. 

AeB 

Consider  the  right  hand  side  and  note  that,  using  the  construction  above,  we  have 

| Pn(x0,A)-n(A)  | = | Pr(X„  G A)  - Pr(Yn  G A)  \ 

- |Pr(XnG  A,Xn  = Yn)  + Pr(Xne  A,Xn±Yn) 

- Pr( Yn  G A,  Xn  = Yn)  - Pr (Yn  G A,  Xn  / Yn)  \ 

= | Pr(X„  G A Xn  ± Yn)  - Pr(Y„  G A,  Xn  ± Yn)  \ 

< max  {Pr(Xn  G A,  Xn  ^ Yn),  Pr(Yn  Gd,Xn/  Yn)} 
Pr(X„  ± Yn) 

Pr(T  > n)  . 


< 
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Hence,  the  coupling  inequality: 

||P"(lo,')-irOII<Pr(T>n).  (2.19) 

The  importance  of  the  coupling  inequality  is  easily  demonstrated.  Firstly,  because 
T is  finite  the  convergence  in  (2.9)  is  immediate  (Lindvall  1992,  Theorem  11.2).  Sec- 
ondly, when  the  entire  state  space  is  small;  i.e.,  when  C = X,  the  coupling  inequality 
immediately  yields  a bound  on  the  total  variation  distance  to  stationarity.  To  see  this, 
note  that  when  C = X,  ( Xi,Yi ) is  always  in  CxC,  which  means  that  a Bernoulli(e-) 
is  drawn  at  every  step.  Thus,  T ~ Geometric^);  that  is,  Pr(T  = n)  = e(l  — e)n-1  for 
n G N.  It  follows  that  Pr(T  > n)  = (1  — e)n,  and  hence  we  have  the  following  result. 
Theorem  2.3.1.  (Meyn  and  Tweedie  1993,  p392)  Suppose  the  Markov  chain  <&  sat- 
isfies assumption  (A)  as  well  as  the  minorization  condition  (2.14)  with  C = X.  Then 

\\Pn{xo,  •)  — 7r(-) ||  < (1  — e)n- 

Observe  that  the  bound  does  not  depend  on  the  starting  value,  x0,  and  hence  $ 
is  uniformly  ergodic.  Uniform  ergodicity  is  not  common  in  MCMC,  but  there  are  a 
few  instances  (Robert  and  Casella  1999,  Chapter  9).  One  important  example  is  the 
independence  Metropolis  algorithm  with  fat  tail  proposals  which  we  now  discuss. 

Example  2.1.1  continued.  Consider  a general  independence  Metropolis  sampler 
with  target  density  k{x)  and  proposal  density  p(x).  Suppose  that  both  of  these 
densities  are  continuous  and  strictly  positive  on  X.  In  this  case,  the  Markov  transition 
kernel,  P,  does  not  have  a density  with  respect  to  Lebesgue  measure  (Tierney  1998), 
but  the  chain  does  satisfy  assumption  (M).  Mengersen  and  Tweedie  (1996)  show  that 
if  there  exists  a k > 0 such  that 

/ r „ 

< k for  all  x , 

P\x) 


(2.20) 
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then  (2.14)  holds  with  C - X and  e = k-1.  Thus,  by  Theorem  2.3.1  the  Markov 
chain  is  uniformly  ergodic  and 


Mengersen  and  Tweedie  (1996)  also  show  that  if  for  every  k > 0 there  is  a set  of 
positive  measure  where  (2.20)  fails  to  hold,  then  the  chain  converges  at  a subgeometric 
rate.  Thus,  there  is  an  “all  or  nothing”  aspect  to  the  independence  sampler.  For  the 


subgeometric. 

Note  that  the  existence  of  k satisfying  (2.20)  is  exactly  what  is  required  to  im- 
plement rejection  sampling  (Robert  and  Casella  1999,  p.49)  with  proposal  density 
p.  However,  unlike  the  rejection  sampler,  the  independence  sampler  can  be  imple- 
mented without  knowing  the  value  of  k.  (See  Tierney  (1994)  and  Caffo,  Booth  and 
Davison  (2001)  for  more  on  this.)  Liu  (1996)  gives  an  interesting  comparison  of  the 
independence  sampler  and  rejection  sampling. 

Unfortunately,  even  when  the  hypotheses  of  Theorem  2.3.1  hold,  it  is  often  the  case 
that  the  value  of  e is  too  small  for  Theorem  2.3.1  to  be  of  any  practical  value.  Indeed, 
there  is  typically  a trade-off  between  the  size  of  the  small  set  and  the  magnitude  of 
e.  When  C is  a proper  subset  of  X,  the  distribution  of  T is  quite  complicated.  This 
case  is  addressed  in  the  next  subsection. 

2.3.3  Connecting  Drift  and  Minorization  to  Geometric  Convergence 

We  now  explain  how  drift  and  minorization  can  be  used  to  establish  geometric 
convergence  to  the  invariant  distribution  when  the  set  C in  (2.14)  is  not  the  entire 
state  space.  We  do  not  intend  this  argument  to  be  rigorous.  We  are  striving  only  to 
convey  the  nature  of  the  connection.  For  a completely  rigorous  approach,  the  reader 


independence  sampler  of  Example  2.1.1  we  have  ir(x)/p(x)  = 9 1 exp{x(0  — 1)}. 
Consequently,  when  9 e (0, 1)  the  chain  is  uniformly  ergodic  and  if  9 > 1 it  is 
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should  consult  Lindvall  (1992),  Meyn  and  Tweedie  (1993),  Rosenthal  (1995a),  and 
Roberts  and  Tweedie  (1999). 

We  have  seen  that  if  the  whole  space  is  small,  then  the  number  of  steps  until  we 
successfully  couple  has  a geometric  distribution.  Suppose  now  that  (2.14)  holds  for 
some  set  C that  is  a proper  subset  of  X.  Then  each  time  we  reach  C,  we  can  draw 
a Bernoulli^)  and  if  we  get  a “success”  we  couple  and  we  can  apply  the  coupling 
inequality. 

Thus,  the  next  step  is  to  consider  how  long  it  takes  between  visits  to  C.  Let  tc 
denote  the  (random)  number  of  steps  it  takes  the  chain  to  return  to  the  set  C;  that 
is,  tq  = min{n  > 1 : Xn  G C}.  Obviously,  the  distribution  of  Tc  will  depend  on  the 
starting  value,  x0.  Suppose  we  could  show  that,  for  every  x0  G C,  Tc  has  a moment 
generating  function.  It  would  then  follow  that  the  time  to  a successful  coupling,  T, 
is  a geometric  sum  of  visits,  each  having  a “light  tailed”  distribution;  so  overall  T 
itself  would  have  a light  tail  and  hence  a moment  generating  function  (Roberts  and 
Tweedie  1999,  Theorem  2.1).  Thus,  there  would  exist  a j3  > 1 such  that  E((3T ) < oo 
and  from  the  coupling  inequality  we  would  have 

(3n\\Pn(x,  •)  - 7r ( * ) ||  < Pn  Pr(T  >n)<  E[(3tI{T  > n)j  ->  0 (2.21) 

as  n -»  oo  by  dominated  convergence.  Therefore,  we  would  be  able  to  conclude 
that  ||Pn(av)  — tt(-)||  = o(/3~n);  that  is,  the  convergence  to  stationarity  occurs  at  a 
geometric  rate.  (Some  might  refer  to  this  as  exponential  convergence,  but  see  Lindvall 
(1992,  p.30).] 

The  role  of  the  drift  condition  is  to  ensure  that  the  return  time,  Tc,  has  the 
required  tail  behavior.  Specifically,  suppose  now  that  (2.16)  holds  and  note  that  we 
may  rewrite  it  as 


AL(x)  :=  E[V(Xi+1) \X{  = x}~  V(x)  < -(1  - A)F(*)  + blc(x). 
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Therefore,  the  expected  change  in  the  value  of  V is  negative  when  x is  not  in  C;  and 
indeed,  not  only  does  V(x)  “drift”  in  to  the  set  where  V is  small  in  this  sense,  but 
it  does  so  in  such  a way  that  from  points  x with  larger  V values,  the  drift  is  faster. 
Conceptually,  this  suggests  that  once  the  chain  leaves  C it  should  return  quickly. 

Mathematically,  it  turns  out  that  V gives  the  required  tail  behavior  in  a very 
explicit  sense.  Indeed,  in  subsection  2.7.4,  we  prove  the  well  known  result  that  (2.16) 
implies  that  for  any  x0  G C 

E [A~TC]  < d + y (2.22) 

A 

(see  also  Meyn  and  Tweedie  1994,  Lund  and  Tweedie  1996).  Hence,  drift  covers  the 
tail  behavior  of  the  return  times  to  C and  minorization  ensures  that  only  a geometric 
number  of  those  times  are  needed;  and  together  they  lead  to  the  existence  of  a f)  > 1 
satisfying  (2.21).  In  the  next  section,  we  state  a theorem  of  Rosenthal  (1995a)  that 
gives  explicit  upper  bounds  on  the  distance  to  stationarity  in  terms  of  the  drift  and 
minorization  conditions. 


2.4  How  much  burn-in? 

Suppose  that  the  Markov  chain  $ satisfies  assumption  (.A).  Here  is  a slightly 
simplified  version  of  Rosenthal’s  (1995a)  result: 

Theorem  2.4.1.  (Rosenthal  1995a)  Suppose  that  $ satisfies  the  following  drift  con- 
dition. For  some  function  V : X i->-  R+,  some  0 < A < 1,  and  some  b < oo 

E[V{Xi+1)\Xi  = x]<\V{x)  + b (2.23) 

(Note  that  this  is  the  drift  condition  given  in  2.13.)  Let  C = {x  \ V(x)  < d} 
where  d is  any  number  larger  than  25/(1  — A).  Suppose  that  $ satisfies  the  following 
minorization  condition.  For  some  probability  measure  Q(-)  and  some  e > 0, 


P(x,  •)  > eQ(-)  V x £ C. 


(2.24) 
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Let  Xq  = Xo  and  define  two  constants  as  follows 


1 +d 

1 + 26  + Ad 


and  U — 1 + 2(A d + 6)  . 


Then  for  any  0 < r < 1 

iip”(*o, •)  - *oii  < (i  - *r  + ( J^)”  (i  + ^ + ^(^«))  • 

In  applying  this  result,  the  user  must  specify  the  values  of  d and  r.  It  makes  sense 
to  do  so  in  such  a way  that  < 1,  otherwise  the  bound  may  not  decrease  in  n. 
Furthermore,  in  our  experience,  slight  changes  in  d and  r can  lead  to  wildly  different 
results,  so  it  pays  to  experiment.  We  use  Theorem  2.4.1  in  a realistic  application  in 
Section  2.6.  Here  is  a simpler  example  of  its  use. 


Example  2.2.1  continued.  Suppose  that  m = 5 and  s 2 = 10.  Then  the  drift 
condition  established  in  Subsection  2.3.1  holds  with  6=1  and  any  A > 1/2.  If  we 
take  A = 1/2,  then  d can  be  any  number  larger  than  26/ (1  — A)  = 4.  If  we  take  d = 6, 
then  e « 0.35.  Then  taking  r — 0.05  and  starting  the  chain  with  /j,0  = y,  we  have 


II Pn  [(00,/*),  •]  - tt(-)||  < (0.9785)"  + 3 (0.9641)"  . 


Hence,  after  220  iterations,  the  total  variation  distance  is  less  than  0.01. 

Cowles  and  Rosenthal  (1998)  and  Roberts  and  Tweedie  (1999)  have  recently  pro- 
vided alternatives  to  Rosenthal’s  bound.  However,  there  is  often  little  practical  dif- 
ference between  Rosenthal’s  bound  and  the  alternatives.  Cowles  and  Rosenthal’s 
(1998)  and  Roberts  and  Tweedie’s  (1999)  results  are  given  in  Section  2.7  along  with 
an  explanation  of  exactly  why  we  do  not  use  them  in  this  chapter.  A rigorous  answer 
to  (Q2)  is  formulated  in  the  next  section. 
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2.5  Monte  Carlo  Standard  Errors 

In  this  section,  we  discuss  the  use  of  regenerative  simulation  (RS)  for  calculating 
standard  errors  of  ergodic  averages.  In  general,  the  regenerative  method  involves 
breaking  simulation  output  up  into  iid  pieces  that  can  be  analyzed  using  standard 
results  for  iid  data.  A complete  development  of  this  subject  can  be  found  in  Ripley 
(1987,  Chapter  6)  or  Bratley  et  al.  (1987,  Chapter  3).  Mykland  et  al.  (1995),  Robert 
(1995),  and  Robert  and  Casella  (1999,  Chapter  8)  discuss  RS  in  the  MCMC  context. 
Regenerative  methods  of  analyzing  simulation-based  output  have  a rich  history  in  the 
operations  research  literature  (see,  e.g.,  Crane  and  Iglehart,  1975;  Glynn,  1985;  and 
Glynn  and  Iglehart,  1987,  1993). 

This  section  consists  of  three  subsections.  A generalization  of  (2.14)  is  introduced 
in  Subsection  2.5.1.  This  more  general  minorization  condition  can  be  used  in  the 
same  manner  as  (2.14)  to  represent  the  Markov  transition  density  as  a mixture  of 
two  densities.  It  is  this  mixture  that  is  used  to  break  the  Markov  chain  up  into  iid 
pieces.  The  details  are  given  in  Subsection  2.5.2.  In  Subsection  2.5.3,  we  explain 
exactly  how  RS  is  used  to  calculate  Monte  Carlo  standard  errors.  The  method  of 
batch  means , which  is  an  alternative  to  RS,  is  also  briefly  discussed. 

2.5.1  A More  General  Minorization  Condition 

A generalization  of  (2.14)  is  as  follows.  For  some  function  s : X ^ R+  such  that 
E*8  > 0 and  some  probability  measure  Q(-)  on  B, 

P(x,-)>s(x)Q{-)  MxtX.  (2.25) 

Clearly,  (2.14)  is  the  special  case  of  (2.25)  where  s(x)  = el(x  £ C ).  Note  that 
plugging  X into  (2.25)  shows  that  s(x)  < 1.  Mykland  et  al.  (1995)  show  that  it  is  often 
easy  to  establish  (2.25)  for  Gibbs  samplers  and  Metropolis-Hastings  algorithms.  (Our 
reasons  for  introducing  this  generalization  will  be  spelled  out  later  in  this  section.) 
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Recall  that  P(x,A ) = JAk(u\x)  du.  We  can  establish  (2.25)  by  finding  a non- 
negative s(-)  and  a density  q(-)  on  X such  that 

k(u\x)  > s(x)  q(u)  Vx,u€X.  (2.26) 

We  now  establish  (2.26)  for  our  toy  Gibbs  sampler  using  a technique  described  in 
Mykland  et  al.  (1995). 

Example  2.2.1  continued.  We  will  construct  a density  q(9,y)  on  X = K+  x R 
and  a function  s(6',y')  such  that 

k(6,y\0',y')  > s(d',y')q(6,y) 

for  all  (0,/i),  (#',//)  6 X.  To  this  end,  let  (6,  y)  be  a “distinguished  point”  in  X and 
let  D be  a set  in  X . Note  that 


k(0,  y\9',  /*')  = f(9\y',y)f{y\6:y) 

lW,y) 


> inf 


,/W.y). 

fW,y) 


(wen  [f(9\y,y)m 
for  all  (9,y),  ( 9',y ')  G X.  Let 


fW,v)f(p\o>v) 

f(B\y,y)f(y\9,y)i[(9,y)  e D\ 


= [ f{B\y,y)  f(y\0,y)d{9,y)  . 
J D 


Now  simply  take  q(9,  y)  = e 1f(9\y,  y)f(y\9,  y)  I [(0,  y)  G D\  and  take 

,n,  l\  ■ c IfW’VY 

s(9  ,y)=e  inf  — — — r- 
(9,n)dD  f(9\y,y) 


Note  that  starting  the  chain  with  a regeneration  corresponds  to  starting  the  sampler 
at  y.  As  a specific  example,  take  the  distinguished  point  to  be  (9,y)  = (1,17)  and 
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D = [di,  d2]  x R where  0 < d\  < d2  < oo.  Then 


inf 


1 + 


m(y  - n') 


f\ 2 


f m(y  - //)2  ] 

eXP{ 251 ) 


and  calculating  e again  boils  down  to  evaluating  the  incomplete  gamma  function.  In 
practice,  the  distinguished  point  is  often  set  at  a preliminary  estimate  of  the  mean  of 
the  stationary  distribution  and  D is  centered  at  that  point;  see  Subsection  2.6.4. 


2.5.2  The  Split  Chain 

The  analogue  of  (2.18)  for  our  new  minorization  condition  is 

k(u\x)  = s(x)  q(u)  + (1  — s(a?))  r(w|*)  (2.27) 

where  the  residual  density  r(u\x)  is  now  defined  as 

. . . k(u\x)  — s(x)  q(u) 
r(u\x ) := t r • 

The  mixture  (2.27)  can  be  used  to  generate  Xi+i  sequentially  as  follows.  Given 
Xi  — x,  generate  8i  ~ Bernoulli(s(x)).  If  8,  = 1,  then  draw  Xi+i  ~ q(-),  else  draw 
Xi+1  ~ r (•  |aj).  What  we  are  actually  doing  here  is  simulating  the  so-called  split  chain 

# = {(Xo,8oUXu8l),{X9,8a),...}  , 

which  has  state  space  X x {0, 1}  (Athreya  and  Ney,  1978;  Nummelin,  1978,  1984). 
The  times  at  which  8,  = 1 are  regeneration  times  when  probabilistically  restarts 
itself.  More  specifically,  suppose  we  start  <£'  with  X0  ~ q(-).  Then  each  time  that 
8i  = 1,  Xi+i  ~ q(-)  and  we  are,  in  effect,  starting  over  again.  Moreover,  the  tours  in 
between  regeneration  times  are  iid.  Observe  that  as  we  make  s(a;)  larger,  we  expect 
the  average  tour  length  to  decrease. 

In  order  to  use  RS  to  get  standard  errors,  we  must  be  able  to  simulate 
The  most  straightforward  way  to  do  this  is  as  described  above.  That  is,  draw 
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Si  ~ Bernoulli(s(a;))  and  then  draw  from  either  q or  r depending  on  the  value  of 
Si.  This  is  problematic,  however,  because  drawing  from  r(-|a;)  can  be  quite  difficult 
in  practice  (see,  e.g.,  Robert  1995).  (Note  that  (2.18)  was  used  only  for  the  theoreti- 
cal argument  leading  to  the  coupling  inequality,  and  hence  the  issue  of  drawing  from 
r never  came  up  in  Subsection  2.3.2.) 

Fortunately,  Mykland  et  al.  (1995)  provide  a simple  and  clever  way  of  avoiding  r 
altogether.  If  we  write  the  transition  as  Xi  — > Si  — > Xi+i,  we  need  to  generate  from 
(Si,  Xi+i)\Xi.  Above,  we  suggested  doing  this  by  first  drawing  from  Si\Xi  and  then 
drawing  from  Xi+1|<5j,  Xi,  which,  if  Si  = 0,  entails  simulation  from  r(-\Xi).  Mykland 
et  al.  (1995)  note  that  simulating  from  the  residual  density  can  be  avoided  by  first 
drawing  from  Xi+i\Xi  (in  the  usual  way)  and  then  drawing  from  Si\Xi,  Xj+i.  A 
straightforward  calculation  shows  that 

Pr(dj  = l|Xi,Xi+1)  = ^^.  (2.28) 

which  is  easy  to  calculate. 

There  is  actually  another  important  advantage  to  drawing  from  (5,-,  Xi+1)|Xj  in 
this  way.  The  method  introduced  by  Mykland  et  al.  (1995)  of  establishing  (2.26)  (for 
Gibbs  samplers)  entails  first  showing  that  k(u\x)  > s'(x)q'(u),  where  q1  is  an  unnor- 
malized density,  and  then  letting  q = q1/  f q1  and  s = s'  f q' . Note,  however,  that 
drawing  from  Si\Xi,  Xi+i  requires  only  the  product  s(-X’j)  g(Xi+1).  Consequently, 
there  is  no  need  to  calculate  the  normalizing  constant! 

We  are  now  in  a position  to  explain  why  we  introduced  (2.25)  in  the  first  place.  It 
is  possible  to  use  the  original  minorization  condition  (2.14)  for  RS.  However,  in  our 
experience,  minorization  conditions  of  the  form  (2.25)  typically  lead  to  many  more 
regenerations  than  those  of  the  form  (2.14).  If  a minorization  condition  is  such  that 
regenerations  happen  very  infrequently,  it  may  take  an  inordinate  amount  of  time 
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to  observe  enough  regenerations  so  that  the  approximations  are  reasonable.  We  now 
explain  how  the  ideas  in  Subsections  2.5.2  and  2.5.1  are  applied  in  RS. 

2.5.3  Regenerative  Simulation 

Assume  that  the  Markov  chain  $ satisfies  assumption  (.A).  We  know  that  En\h\  < 
oo  implies  that  hn  :=  hn$  — n-1  H^i)  is  a strongly  consistent  estimator  of 
E^h.  Assume  further  that  $ is  geometrically  ergodic  and  that  £,7r|/i|2+e  < oo  for  some 
e > 0,  so  that  we  have  the  following  CLT 

^-^)4N(0,^).  (2.29) 

Estimation  of  al  is  difficult  because  the  Xi  s constituting  hn  are  not  independent.  By 
using  the  split  chain,  we  can  rewrite  hn  as  a function  of  iid  bivariate  random  vectors. 
This  trick  allows  us  to  use  standard  techniques  from  iid  theory  to  estimate  a\.  Here 
are  the  details. 

Suppose  that  we  have  established  (2.26)  so  we  can  simulate  $'  using  the  technique 
of  Mykland  et  al.  (1995).  Let  r0  < n < ■ ■ • be  the  (random)  regeneration  times;  i.e., 
rt+i  = min{*  > rt  : = 1}.  Assume  that  To  = 0 so  the  chain  is  started  with  a 

regeneration;  that  is,  X0  ~ q(-).  (Mykland  et  al.  (1995)  show  that  starting  with  a 
regeneration  is  quite  easy  for  standard  MCMC  samplers.)  Also  assume  that  $ is  run 
for  a fixed  number,  R , of  tours;  that  is,  the  simulation  is  stopped  the  Rth  time  that  a 
Si  = 1.  Thus,  the  total  length  of  the  simulation,  N,  is  random.  Let  Nt  be  the  length 
of  the  tth  tour;  that  is,  Nt  — rt  — rt_i  and  define 

Tt~  i 

s,=  Y. 

j-Tt- 1 

for  t = 1, . . . , R.  As  Mykland  et  al.  (1995)  point  out,  the  ( Nt , St)  pairs  are  iid  since 
each  is  based  on  a different  tour.  Assume  that  Nt  and  St  have  finite  second  moments. 
Let  N be  the  average  tour  length;  that  is,  N — R~x  XltLi  M and,  analogously,  let 
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S — R 1 St-  By  the  Strong  Law  of  Large  Numbers 


r _ Et=i  st  _ S_  _ J_  . x 

R yR  Nt  N N " J 

L^dt— 1 1 * j=0 

with  probability  1 as  i?  ->  oo.  Furthermore,  by  the  CLT 


N- 1 


Enh 


(2.30) 


v^(ji-w)4N(»,«a 


(2.31) 


However,  it  is  an  open  question  as  to  whether  the  variance  of  the  asymptotic  distri- 
butions in  (2.29)  and  (2.31)  coincide.  Now,  the  cr \ from  (2.31)  may  be  consistently 
estimated  (Glynn  and  Iglehart  1993)  with  the  standard  (delta  method)  variance  es- 
timation formula  for  a ratio  estimator  (Kendall  and  Stuart  1963), 


(Ju  - 


Ef=i  (St  - hRNty 
RN2 


(2.32) 


Given  this  estimate,  we  can  form  an  asymptotic  confidence  interval  for  E^h  using  the 
formula 


K±z 


1/2 


where  2 denotes  the  appropriate  standard  Normal  quantile.  Mykland  et  al.  (1995) 
recommend  using  (2.32)  only  when  the  estimated  coefficient  of  variation  (CV)  of  N 
is  less  than  0.01.  We  illustrate  the  use  of  the  RS  method  in  a realistic  situation  in 
Section  2.6. 

We  note  that  there  is  little  discussion  in  the  literature  about  how  to  actually 
establish  that  Nt  and  St  have  finite  second  moments.  One  might  imagine  that  the 
geometric  ergodicity  of  $ combined  with  the  fact  that  En\h\2+t  < oo  would  imply 
that  these  moments  are  finite.  In  our  applications,  we  have  simply  assumed  Nt  and 
St  have  finite  second  moments. 

Despite  its  attractive  theoretical  properties,  there  seem  to  be  few  substantive 
applications  of  RS  in  the  MCMC  literature.  Four  examples  are  Geyer  and  Thompson 
(1995),  who  use  regeneration  to  calculate  Monte  Carlo  standard  errors  in  the  context 
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of  their  simulated  tempering  algorithm,  Gilks,  Roberts  and  Sahu  (1998),  who  employ 
regeneration  to  create  adaptive  MCMC  algorithms,  Guihenneuc-Jouyaux  and  Robert 
(1998),  who  use  renewal  theory  as  an  approach  to  assessing  convergence,  and  Levine 
and  Casella  (2000)  who  employ  RS  in  the  context  of  the  Markov  chain  Monte  Carlo 
EM  algorithm. 

We  now  briefly  describe  the  method  of  batch  means,  which  is  an  alternative 
method  of  calculating  Monte  Carlo  standard  errors.  This  technique  is  a special  case 
of  a methodology  called  standardized  time  series,  and  is  the  method  used  by  the 
popular  software  package  BUGS.  (For  more  details,  see  Ripley  (1987,  Chapter  6), 
Bratley  et  al.  (1987,  Chapter  3)  or  Geyer  (1992).) 

Consider  estimating  Enh  with  hn  and  suppose  it  is  known  that  a CLT  of  the  form 
(2.29)  holds.  The  run  of  the  sampler  is  broken  up  into  batches  of  equal  size  that  are 
assumed  to  be  approximately  independent.  Specifically,  suppose  the  algorithm  is  run 
for  a total  of  n = aft  iterations  where  ft  is  large  enough  so  that  the  quantities 

kb- 1 

= i E h(x<) 

° t=(fc— 1)6 

are  approximately  independently  N \E^h,^^  for  k = 1 The  batch  means 

estimate  of  a l is 

^ = ^rr  E (s‘  - ■ (2-33> 

k= 1 

Bratley  et  al.  (1987,  Chapter  3)  recommend  forming  an  approximate  confidence  in- 
terval for  Enh  using 

- / a2h\lf 2 

hn±ta~1  [ri) 

where  fta_i  is  the  appropriate  quantile  from  the  t-distribution  with  a — 1 degrees  of 
freedom.  The  estimator  (2.33)  is  not  a consistent  estimator  of  c2h  when  6 is  fixed 
and  will  generally  lead  to  wider  confidence  intervals  than  if  a consistent  estimator 
were  used  (Glynn  and  Iglehart  1990).  Furthermore,  Geyer  (1992)  points  out  that  the 
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batch  means  method  will  be  effective  only  if  the  size  of  the  batches,  b , is  much  larger 
than  the  mixing  time  for  the  chain.  In  most  practical  applications,  the  only  way  to 
get  a handle  on  the  mixing  time  is  by  analyzing  the  empirical  autocorrelations. 

We  view  the  batch  means  method  as  an  ad  hoc  version  of  the  RS  method.  In 
particular,  both  methods  break  the  run  of  the  sampler  up  into  pieces.  The  difference 
is  that  in  the  RS  method,  this  is  done  in  a way  that  guarantees  that  the  pieces  are 
independent.  Consequently,  it  is  not  necessary  to  analyze  the  empirical  autocorre- 
lations before  applying  RS.  (Of  course,  constructing  a useful  minorization  condition 
is  usually  much  harder  than  examining  empirical  autocorrelation  plots.)  Standard 
errors  produced  using  RS  are  compared  with  those  produced  using  batch  means  in 
the  next  section,  which  contains  a realistic  application  of  all  the  techniques  described 
so  far  in  this  chapter. 


2.6  A Realistic  Application 

The  methods  described  in  Sections  2.3,  2.4  and  2.5  are  now  used  to  construct 
rigorous  answers  to  (Ql)  and  (Q2)  for  the  block  Gibbs  sampler  for  model  (A4).  This 
section  has  three  subsections.  In  Subsection  2.6.1  we  discuss  the  data  set  that  will  be 
analyzed,  the  priors  that  will  be  considered,  and  the  “posterior  quantities  of  interest” 
that  will  be  estimated.  A detailed  description  of  the  block  Gibbs  sampler  is  given  in 
Subsection  2.6.2.  Honest  answers  to  (Ql)  and  (Q2)  are  provided  in  Subsections  2.6.3 
and  2.6.4,  respectively. 

2.6.1  The  Data,  Priors,  and  Posterior  Quantities  of  Interest 

Lyles  et  al.  (1997)  described  an  experiment  in  which  laminators  working  at  a boat 
manufacturing  plant  were  measured  for  styrene  exposure.  Specifically,  13  workers 
were  randomly  selected  from  a group  within  the  plant  and  each  one’s  styrene  exposure 
was  measured  on  3 separate  occasions.  The  data  are  summarized  in  Table  2.6. 
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Table  2.6:  Styrene  Exposure  Data 


Worker 

1 

2 

3 

4 

5 

6 

7 

Vi 

3.302 

4.587 

5.052 

5.089 

4.498 

5.186 

4.915 

Worker 

8 

9 

10 

11 

12 

13 

Vi 

4.876 

5.262 

5.009 

5.602 

4.336 

4.813 

Mt  = Km 

= 39 

V 

II 

2-jj: 

=1  yij = 

4.809 

SSTR  = 3 

Eili  (ft  - 

-y?  = 

11.430 

SSE 

^-^13 

J2j=l(Vij 

- Vi? 

= 14.711 

Lyles  et  al.  (1997)  performed  a frequentist  analysis  of  these  data  using  a random  ef- 
fects model.  We  consider  a Bayesian  analysis  using  model  (M)  from  Subsection  2.1.3. 
Specifying  the  prior  is  tantamount  to  choosing  values  for  the  six  hyperparameters: 
ai,  h,  b 2,  Mo  and  A0.  While  we  are  actually  interested  in  the  performance  of  the 

block  Gibbs  sampler  for  all  possible  choices  of  the  hyperparameters,  we  will  settle  for 
studying  the  six  different  hyperparameter  settings  that  are  shown  in  Table  2.7. 

Table  2.7:  Hyperparameter  Settings 


Setting 

Oi 

bi 

62 

Mo 

Ao 

c 

1 

60.176 

7.7573 

3.1237 

1.7674 

4.809 

1 

1 

2 

601.76 

77.573 

31.237 

17.674 

4.809 

0.1 

0.1 

3 

0.1 

0.1 

0.1 

0.1 

4.809 

0.1 

- 

4 

1 

5 

1 

1 

3.6 

1 

- 

5 

0.6 

1 

120 

16 

4.809 

1 

- 

6 

4 

80 

40 

100 

4 

1 

- 

The  first  two  settings  in  Table  2.7  represent  priors  that  are  consistent  with  the 
data.  They  are  based  on  the  ANOVA  estimates  of  m,  A"1  and  A^ 1 as  we  now  describe. 
Define  MSE  = SSE/(MT  - K)  and  MSTR  = SSTR/(AT  - 1).  The  ANOVA  estimates 
of  n,  Aj1  and  A# 1 are  y,  MSE,  and  (MSTR— MSE)/ra,  respectively  (Searle  et  al.  1992, 
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Chapter  3).  We  set  the  prior  expectations  for  Xg  and  Ae  equal  to  the  obvious  values 

7.76  and 


m 


E 6,  MSTR  - MSE 


E ^ = | = M k = 177 


and  then  solved  for  aq,  b\,  a2,  and  b2  by  setting  the  prior  variances  both  equal  to 
c E {0.1, 1}.  As  for  y,  the  prior  mean  was  set  equal  to  y and  we  considered  a 
couple  of  different  prior  variances.  Setting  #3  is  a so-called  “diffuse”  prior;  that  is, 
all  of  the  prior  variances  are  large.  As  we  will  see  below,  settings  #4  - #6  were 
selected  to  illustrate  certain  points  about  how  our  answer  to  (Ql)  depends  upon  the 
hyperparameters.  We  now  describe  the  posterior  quantities  of  interest. 

Recall  from  (2.4)  that  the  posterior  density  is  characterized  by 


tt(0,  y,  Ae,  \e\y)  oc  f{y\G,  Xe)f(0\y,  Xg)  f (Xe)  f (y)  f (Xg)  . 


Due  to  the  conjugacy  in  model  (A4), 

9(A»,  Ae)  :=  f f f(y\e,  A ,)/(<%,  i»  dn 

Jr  Jrk 

has  a closed  form.  However,  the  normalizing  constant  for  tt(9,  y,  Xe,  Xg\y)  given  by 

cn  I I gi^Xg,  Ae)  dXe  dXg 
J R+  J R+ 

does  not  have  an  analytic  solution.  We  will  take  the  posterior  quantities  of  interest 
to  be  the  posterior  expectations  of  A g and  Ae;  that  is, 

E(Xe\y)  = / J*eg(\e,\e)  dXe  dXg  ^ = / fXeg{Xe,Xe)  dXe  dXg 

Cjr  Cjr 

Each  of  these  is  a ratio  of  intractable  two-dimensional  integrals,  and  can  therefore  be 
computed  (more  or  less)  exactly  using  numerical  integration.  On  the  other  hand,  if 
we  were  interested  in  the  posterior  expectation  of  a complex  function  of  (0,  y , Ae,  A#), 
then  the  dimension  of  one  of  the  intractable  integrals  would  likely  be  much  higher  than 
two  and  could  be  as  high  as  K + 3.  Fortunately,  the  RS  procedure  that  we  develop 
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and  use  here  can  be  used  to  get  a standard  error  for  any  posterior  expectation  (as 
long  as  a CLT  holds).  Thus,  there  is  really  no  loss  of  generality  in  looking  at  such 
simple  posterior  expectations.  In  the  next  subsection,  we  describe  the  block  Gibbs 
sampler  that  will  be  applied. 


2.6.2  The  Block  Gibbs  Sampler 

The  two  obvious  fixed  scan  Gibbs  samplers  that  can  be  employed  to  sample  from 
the  posterior  7r  in  (2.4)  are  (i)  the  ordinary  “one-at-a-time”  version  that  updates  each 
component  sequentially,  and  (ii)  a block  version  in  which  all  of  the  Normal  compo- 
nents ($i, . . . ,9k,  y)  are  updated  simultaneously.  Given  the  work  of  Liu,  Wong  and 
Kong  (1994)  and  Roberts  and  Sahu  (1997),  it  seems  likely  that  the  block  Gibbs  sam- 
pler will  mix  faster  than  the  ordinary  Gibbs  sampler.  Generally  speaking,  blocking 
is  effective  when  the  constituent  parts  of  the  block  are  highly  correlated.  Also,  pro- 
gramming the  block  Gibbs  sampler  is  only  slightly  more  difficult  than  programming 
the  ordinary  Gibbs  sampler.  Henceforth,  we  confine  our  attention  to  the  block  Gibbs 
sampler  which  is  formally  defined  as  follows. 

Let  £ = (#!,...,  9k,  y)T  and  A = (A#,  \e)T  and  define 

K K 

Vi(0  = 5^(0*  - A*)2  and  K2(0  = - y{)2  and  SSE  = Y'SVij  ~ Vi)2 

»=1  i= 1 i,j 

where  y{  = m~l  Y^jLi  Vij • The  full  conditionals  for  the  variance  components  are 


Af?|£,  A e,y  ~ Gamma 


K 

j+au 


Vi(i) 

2 


+ h 


and 


Ae|£,A e,y  ~ Gamma 


Mr  , K2(0  + SSE  , L 

— — h 02) h 02 


Hobert  and  Geyer  (1998)  show  that  £|A,  y ~ N($*,  S)  and  give  the  specific  forms  of 
«*=«*(A,»)  andS  = S(A,v). 
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One  cycle  of  the  block  Gibbs  sampler  consists  of  updating  A#,  Ae,  and  £ in  some 
order.  Note  that  A<?  and  Ae  are  conditionally  independent  given  £,  and  hence  the  order 
in  which  they  are  updated  is  irrelevant.  Thus,  in  effect,  the  block  Gibbs  sampler  is  a 
data  augmentation  algorithm  (Tanner  and  Wong  1987),  the  two  components  being  £ 
and  A.  If  we  update  A first,  a one-step  transition  looks  like  (A',  £')  — > (A,  £')  — >■  (A,  £), 
and  the  corresponding  Markov  transition  density  is 

*(A,  «|A',  C)  = y)f(K\e, »)/(« |A,  y).  (2.34) 

It  is  easy  to  show  that  this  Markov  chain  satisfies  assumption  (.A).  In  Chapter  3 we 
establish  drift  and  minorization  conditions  for  this  block  Gibbs  sampler. 

2.6.3  Honest  Burn-in 

For  each  of  the  six  hyperparameter  settings  in  Table  2.7,  we  used  Theorem  2.4.1 
in  conjunction  with  the  drift  and  minorization  conditions  in  Chapter  3 to  find  a value 
of  n such  that 

ll^n[(Ao,4o),-]-7r(-)||<0.01  (2.35) 

where  (A0,  £0)  is  the  starting  value.  In  each  case,  we  used  £0  = (yx, . . . , y#,  y)T . As 
is  clear  from  (2.34),  a starting  value  for  A is  not  required.  (This  convergence  criterion 
(<  0.01)  has  become  fairly  standard  (Rosenthal  1996,  Cowles  and  Rosenthal  1998, 
Roberts  and  Rosenthal  1999a).) 

Table  2.8  contains  the  results.  For  example,  under  the  first  hyperparameter  set- 
ting, after  3 x 108  iterations  of  the  block  Gibbs  sampler,  the  total  variation  distance 
to  stationarity  is  at  most  0.00429.  It  takes  about  2 minutes  to  run  1 million  iterations 
of  our  block  Gibbs  sampler  on  a fast  PC.  Consequently,  the  only  settings  that  result 
in  unmanageable  burn-in’s  are  settings  #3  and  #4.  Recall  that  setting  #3  is  the 
diffuse  prior.  The  result  for  setting  #4  is  typical  of  those  settings  in  which  /x0  is  far 
from  y and  we  included  setting  #4  specifically  to  demonstrate  this  problem.  Settings 
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#5  and  #6  are  the  result  of  “playing  around”  with  the  hyperparameters.  Indeed, 
with  #5  we  were  trying  to  find  a setting  that  would  give  a very  short  burn-in.  With 
#6,  we  were  trying  to  see  if  it  was  possible  to  find  a setting  in  which  |/x0  — y\  is  not 
small,  but  the  resulting  burn-in  is  still  manageable.  It  is  interesting  to  note  that  the 
value  of  A0  does  not  appear  in  the  drift  condition  or  in  the  minorization  condition 
and  hence  its  value  has  no  bearing  on  the  results  in  Table  2.8. 

Table  2.8:  Total  Variation  Bounds  for  the  Styrene  Exposure  Data 


Setting 

A 

b 

d 

r 

£ 

Iterations 

Bound 

1 

0.4810 

2.9872 

12.311 

0.0107 

4.00  x 10~12 

3 x 108 

0.00429 

2 

0.2112 

3.5925 

10.708 

0.0426 

2.74  x 10~6 

10,000 

0.00997 

3 

0.0504 

2.5097 

11.212 

0.0158 

5.39  x 10"14 

2 x 1012 

0.00324 

4 

0.4265 

25.380 

93.20 

0.0059 

1.06  x 10“15 

7 x 1017 

0.00638 

5 

0.1279 

1.6620 

4.8113 

0.08550 

4.48  x 10"3 

4,600 

0.00921 

6 

0.1450 

22.737 

61.283 

0.0275 

1.11  x IQ"4 

95,000 

0.00515 

There  are  three  possible  reasons  why  our  results  suggest  that  such  a long  burn- 
in  is  necessary  for  settings  #3  and  #4:  (i)  The  results  merely  reflect  the  fact  that 
the  block  Gibbs  sampler  mixes  very  slowly  under  these  hyperparameter  settings,  (ii) 
Theorem  2.4.1  happens  to  be  very  conservative  in  these  particular  cases,  or  (iii) 
The  drift  and  minorization  conditions  are  not  tight  for  these  settings.  The  results 
in  the  next  subsection  suggest  that  the  real  reason  is  probably  (ii)  or  (iii)  or  some 
combination  of  the  two. 


2.6.4  Honest  Standard  Errors 

The  drift  and  minorization  conditions  established  in  Chapter  3 show  that  our 
block  Gibbs  sampler  is  geometrically  ergodic  (see  also  Hobert  and  Geyer  1998).  Fur- 
thermore, it  is  easy  to  show  that  E.n{Xp9\y)  and  E7r(X1g\y)  are  both  finite  for  any 
p > 0.  Thus,  according  to  the  results  of  Chan  and  Geyer  (1994),  there  are  CLTs  for 
the  Monte  Carlo  estimators  of  our  posterior  quantities  of  interest. 
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A minorization  condition  of  the  form  (2.26)  is  constructed  for  our  block  Gibbs 
sampler  in  Chapter  3.  In  order  to  use  this  minorization  condition  for  RS,  we  had 
to  choose  the  distinguished  point  £ = (9i, . . . , Ok,  y)T  £ Rir+1  and  the  d^s  that 
determine  the  set  D — [di,  d2]  x [d3,d4]  C R^_ . These  choices  were  made  as  follows. 
The  block  Gibbs  sampler  was  run  for  an  initial  10,000  iterations  from  the  starting 
value  £0  = (yx, . . . , yK,  y)T . Let  0f\  ..  .,9$,  X^  and  denote  the  estimates 
of  the  posterior  expectations  of  the  corresponding  parameters  based  on  these  initial 
10,000  iterations.  For  the  distinguished  point,  we  set  £ = \ . . . , 9^\  n^)T.  As  for 

D,  we  set  [di,d2]  = X^  ± 1.1s  where  s is  the  (usual)  sample  standard  deviation  of 
the  10,000  values  of  Xg,  and  the  interval  [<i3 , <i4]  was  constructed  similarly. 

For  each  of  the  six  hyperparameter  settings,  we  performed  RS  as  described  in 
Section  2.5.  Specifically,  we  started  with  a regeneration;  that  is,  Xo  ?(A,£),  and 
then  ran  the  chain  for  as  many  regenerations  as  we  needed  to  get  the  Monte  Carlo 
error  down  to  a reasonable  level.  (Due  to  the  form  of  ^(A,  £),  taking  X0  ?(*>£) 
is  equivalent  to  starting  the  Gibbs  chain  at  £.)  Table  2.9  contains  the  results.  For 
example,  we  are  (approximately)  95%  confident  that  the  true  value  of  E(Xg\y)  under 
the  first  prior  is  in  the  interval  (7.753,  7.765).  (In  each  case,  the  estimated  CV  of  N 
was  less  than  0.01.)  Reported  in  Table  2.10  are  the  total  number  of  regenerations 
(. R ) for  which  the  chain  was  run  upon  which  the  variance  estimates  in  Table  2.9  are 
based  and  the  mean  number  of  iterations  per  regeneration. 

It  is  important  to  recognize  that  we  did  not  use  any  burn-in  to  obtain  the  results 
in  Table  2.9.  Indeed,  as  noted  by  Ripley  (1987,  Chapter  6),  Bratley  et  al.  (1987, 
Chapter  3)  and  Mykland  et  al.  (1995),  one  of  the  best  features  of  the  RS  method  is 
that  burn-in  is  simply  not  an  issue.  On  the  other  hand,  you  could  consider  the  initial 
10,000  iterations  used  to  construct  the  distinguished  point  and  the  set  D as  a form 
of  burn-in. 
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Table  2.9:  Point  Estimates  and  Asymptotic  95%  Confidence  Intervals  for  E(Xg\y) 
and  E(Xe\y) 


Setting 

Parameter 

Estimate 

<j2 

95%  Cl 

1 

x$ 

7.759 

0.2002 

(7.753,  7.765) 

Xe 

1.779 

0.0435 

(1.776,  1.782) 

2 

Xg 

7.758 

0.0305 

(7.755,  7.761) 

Xe 

1.769 

0.0227 

(1.766,  1.772) 

3 

Xg 

7.363 

7.9734 

(7.348,  7.378) 

Xe 

1.793 

0.0161 

(1.792,  1.794) 

4 

Xg 

0.958 

0.0251 

(0.955,  0.961) 

Xe 

1.756 

0.0453 

(1.752,  1.760) 

5 

Xg 

2.437 

0.3036 

(2.426,  2.448) 

Xe 

5.699 

0.0538 

(5.694,  5.704) 

6 

Xg 

0.118 

0.0003 

(0.1176,  0.1184) 

Xe 

0.498 

0.0012 

(0.497,  0.499) 

Table  2.10:  Regenerative  Behavior  of  the  Block  Gibbs  Sampler 


Setting 

Number  of 
regenerations 

Mean  number 
of  iter /regen 

1 

25000 

5.68 

2 

12000 

3.39 

3 

150000 

24.4 

4 

10000 

7.43 

5 

10000 

5.04 

6 

6000 

4.55 

Note  that  the  chains  corresponding  to  settings  #3  and  #4  were  run  for  about 
3.7  million  and  75,000  iterations,  respectively,  and  this  was  enough  to  get  the  Monte 
Carlo  error  down  to  a reasonable  level.  Hence  it  appears  that  these  chains  mix  much 
faster  than  is  suggested  by  the  numbers  in  Table  2.8.  (This  is  especially  true  for  the 
chain  corresponding  to  setting  #4.)  Thus,  the  fact  that  our  results  suggest  such  long 
burn-in’s  for  settings  #3  and  #4  is  probably  due  to  either  the  conservativeness  of 
Theorem  2.4.1  or  the  “looseness”  of  the  drift  and  minorization  conditions  in  Chapter 


3. 


61 


Based  on  Table  2.10,  it  seems  that  the  chain  associated  with  setting  #3  is  by 
far  the  slowest  mixing  of  the  6.  Recall  that  setting  #3  is  the  diffuse  prior.  This 
is  consistent  with  the  findings  of  Natarajan  and  McCulloch  (1998)  whose  empirical 
results  suggest  that  the  mixing  rate  of  the  Gibbs  sampler  becomes  slower  as  the  priors 
become  more  diffuse  (but  see  van  Dyk  and  Meng  2001). 

Table  2.11:  Batch  Means  Estimates  and  Asymptotic  Confidence  Intervals 


Setting 

Parameter 

Estimate 

ba2 

95%  Cl 

Batch  Size 

Iterations 

1 

A<? 

7.756 

0.9959 

(7.751,  7.761) 

4733 

141990 

Ae 

1.778 

0.1920 

(1.776,  1.780) 

2 

A# 

7.757 

0.0966 

(7.754,  7.760) 

1359 

40770 

Ae 

1.770 

0.0658 

(1.767,  1.773) 

3 

A# 

7.352 

179.07 

(7.338,  7.367) 

122000 

3660000 

Ae 

1.793 

0.3817 

(1.792,  1.794) 

4 

A e 

0.957 

0.2027 

(0.954,  0.960) 

2476 

74280 

Ae 

1.760 

0.0019 

(1.756,  1.764) 

5 

A 9 

2.435 

1.4942 

(2.424,  2.446) 

1680 

50400 

K 

5.702 

0.1850 

(5.698,  5.706) 

6 

A e 

0.119 

0.0014 

(0.114,  0.124) 

911 

27330 

K 

0.498 

0.0045 

(0.497,  0.499) 

For  the  sake  of  comparison,  we  also  calculated  approximate  confidence  intervals 
for  the  posterior  quantities  of  interest  using  the  method  of  batch  means.  To  make 
the  comparison  with  RS  fair,  we  used  a burn-in  of  10,000  iterations  and  ran  the  block 
Gibbs  sampler  (from  the  same  starting  point)  for  roughly  the  same  overall  number 
of  iterations.  When  the  length  of  the  run  is  fixed,  Schmeiser  (1982)  has  shown  that, 
no  matter  how  long  it  is,  somewhere  between  10  and  30  batches  is  reasonable.  An 
examination  of  the  empirical  autocorrelation  function  indicated  that,  in  each  case, 
using  30  batches  resulted  in  sufficiently  large  batch  sizes.  Table  2.11  contains  the 
batch  means  results.  Note  that  the  confidence  intervals  in  Tables  2.9  and  2.11  are 
nearly  identical.  The  variance  estimate  reported  in  Table  2.11  is  (2.33). 
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2.6.5  Minorization  for  the  Block  Gibbs  Sampler 

In  this  section,  we  construct  a minorization  condition  of  the  form  (2.26)  for  the 
block  Gibbs  sampler  introduced  in  Subsection  2.6.2.  We  begin  by  fixing  a distin- 
guished point  £ = (9i, . . . , 9k,  fj)T  £ MK+1  and  let  D = [di,d2]  x [^3,^4]  C . For 
notational  convenience,  let  V(  — Vi(£)  and  V = Vi(£)  for  i — 1,2.  (Note  that  Vi  and 
V2  are  defined  in  Subsection  2.6.2.)  Then  we  have 


/(A|£\y)/(£|A,y)>s(£0<z(A,£) 


where  q is  a density  on  D x R^+1  given  by 


q(\,£)  = -f(\\ly)f(£\\y)  I(\e  D) 


for  c — fD  /(A|£,  y)  dX  and 

8{$!)  = cinf 
xeD 


/(A|*,y). 

Gamma  (y  + ai,bi  + \V{\  A g)  Gamma  (^  + a2,b2  + ^V2]  Ae) 
Gamma  ^y  + ai,&i  + |Vi;  A^  Gamma  + a2,b2  + |Vi;  A e^j 

where  b2  = b2  + SSE/2.  Now  straightforward  calculations  reveal  that 


= c inf 
xeD 


«(*')  = c 

where 


2b!  + Vj 
L 2b\  + Vi\ 


V+°i  Toi./ 


2 b'2  + V2 
2b'2  + V2 . 


-+02 


exp{|(U-K)  + | (Vi-Vi)} 


9e 


d3  if  V2  > V' 


d!  if  Vi  > V{ 

and  ge  = < 

d2  if  Vi  < VI  I d4  if  V2  < V' 

Finally,  for  the  transition  (A',  £')  —>  6 — > (A,  £),  we  have  as  long  as  A E D 

Pr[i  = 1|(A',0.  (A, €)]  = exp  - A„)(U  - Vj)  + \(ge  - \e)(Vi  - V2')}  . 

(2.36) 

Observe  that  the  value  of  the  normalizing  constant  c was  not  required  for  this  calcu- 


lation. 
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2.7  Some  Theoretical  Details 

Throughout  this  chapter  we  have  focused  on  explaining  the  connection  between 
geometric  ergodicity  and  obtaining  rigorous  answers  to  (Ql)  and  (Q2).  At  several 
steps  along  the  way  we  have  made  simplifying  assumptions  for  the  sake  of  clarity.  In 
this  section  we  attempt  to  tie  up  many  of  these  loose  ends  by  filling  in  some  of  the 
details  about  several  earlier  claims. 

The  rest  of  this  section  is  organized  as  follows.  In  Subsection  2.7.1  we  introduce 
some  basic  Markov  chain  theory  from  a more  formal  point  of  view  than  that  of 
Subsection  2.2.1.  In  particular,  we  spell  out  assumption  {A)  in  some  detail  and  fix 
some  notation  that  we  will  use  throughout  the  rest  of  this  section.  We  then  consider 
some  properties  of  the  total  variation  norm  in  Subsection  2.7.2.  In  Subsection  2.7.3 
we  establish  a lemma  that  is  useful  for  converting  between  the  two  types  of  drift 
conditions  that  we  introduced  in  this  chapter.  Next,  in  Subsection  2.7.4  we  show 
that  a drift  and  minorization  condition  ensure  a moment  generating  function  for  the 
hitting  time  of  the  small  set.  Section  2.7.5  contains  a few  theorems  that  will  allow 
the  construction  of  total  variation  bounds  given  drift  and  minorization  conditions. 

2.7.1  Assumption  (A) 

Nummelin  (1984)  and  Meyn  and  Tweedie  (1993)  both  give  a thorough  develop- 
ment of  general  state  space  Markov  chain  theory.  Accounts  focusing  on  issues  relevant 
to  MCMC  include  Tierney  (1994),  Tierney  (1995),  and  Robert  and  Casella  (1999). 

Let  X C and  B = B(X)  be  a countably  generated  cr-algebra.  Suppose  4>  = 
(*n  , n = 0, 1, . . .)  is  a discrete  time,  time  homogeneous  Markov  chain  with  state  space 
X.  This  chain  evolves  according  to  a transition  kernel , P,  where  P : X x 8 — > [0, 1] 
such  that 

1.  P(x,  •)  is  a probability  measure  for  any  fixed  x E X\ 

2.  P(-,  A)  is  measurable  for  any  fixed  A € B . 


64 


We  allow  P to  act  to  the  right  on  functions  and  to  the  left  on  measures.  That  is,  if 
v is  a probability  measure  and  / is  a real-valued  ^-measurable  function 


Following  Meyn  and  Tweedie  (1993,  p.  67)  we  define  the  n-step  transition  probability 
kernel  inductively.  Let  x £ X and  A £ B.  Also,  let  SX(A)  be  the  Dirac  measure,  that 
is,  SX(A)  = 1 when  x £ A and  0 otherwise.  We  set  P°(x,  A)  = 5X(A)  and  for  n > 1 


If  X is  discrete  then  the  transitions  are  governed  by  a transition  matrix  P = 
(pij)  such  that  pij  > 0 for  all  i,j  and  whose  rows  sum  to  1.  In  this  case,  Pn  can 
be  obtained  by  matrix  multiplication.  When  $ lives  on  a general  state  space  the 
Lebesgue  decomposition  can  be  used  to  show  that  Pn  generally  has  an  associated 
density  (Meyn  and  Tweedie  1993,  p 107).  Unfortunately,  Pn  and  its  density  are 
only  rarely  available  in  closed  form.  However,  standard  MCMC  samplers  do  have  an 
associated  one-step  Markov  transition  density , k(x\x'),  that  we  can  write  down  in 
closed  form.  For  example,  as  we  observed  in  Subsection  2.2.1,  the  Markov  transition 
density  for  the  Gibbs  sampler  is  just  the  product  of  the  full  conditionals. 

A (j-finite  measure  7r  is  invariant  for  the  kernel  P if  7 r(A)  = ( nP)(A ) for  all  A £ B. 
If  7r  is  a probability  measure  and  X0  ~ 7r  then  Xn  ~ 7r  for  every  n and  $ is  said 
to  be  stationary  in  distribution.  MCMC  algorithms  are  typically  created  in  such  a 
way  that  the  target  distribution  is  the  invariant  distribution  for  the  induced  Markov 


for  all  x £ X and  all  A £ B.  Also,  put 


chain. 
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Suppose  X is  discrete  and  define  rx  = min{n  > 1 : Xn  = x}.  That  is,  tx  is  the 
first  time  that  x is  visited.  Let  x and  y be  any  two  elements  of  X.  In  this  case,  we 
say  that  x communicates  with  y if 

Pr(ry  < oo|X0  = *)  > 0 and  Pr^  < oo|X0  = y)  > 0 . 

In  this  case,  we  write  x ++  y.  It  is  easy  to  show  that  “o”  is  an  equivalence  relation 
and  hence  the  equivalence  classes 

C(x)  :={y  x O y}, 

with  x G C(x),  cover  X.  When  C(x)  = X for  some  x E X we  say  that  $ is 
irreducible. 

When  X is  uncountable  we  often  have  Pr(ry  < oo|X0  = x)  uniformly  equal  to  0. 
Thus  it  is  obvious  that  we  need  to  extend  our  notion  of  irreducibility.  Given  a measure 
ip,  the  Markov  chain  $ with  transition  kernel  P is  said  to  be  ip-irreducible  if  for  every 
A € B such  that  <p(A)  > 0 there  is  an  n such  that  Pn(x,  A)  > 0 for  all  x G X.  When  <f> 
is  (^-irreducible  there  exists  a unique  (up  to  constant  multiples)  maximal  irreducibility 
measure  ip  on  B such  that  (i)  <f>  is  V'-irreducible;  (fi)  if  £ is  any  other  irreducibility 
measure  then  £ -C  ip',  (iii)  if  ip{A)  = 0 then  ip({x  : Pr(r^  < oo|Xo  = x)  > 0})  = 0. 

Let  gcd  denote  the  greatest  common  divisor.  When  X is  discrete  then  the  period 
of  a state  x G X is 

d(x)  = gcd{n  > 1 | Pn(x,  {a:})  > 0}. 

Every  state  that  communicates  with  x has  the  same  period  as  x.  If  all  states  com- 
municate and  d — 1 then  the  chain  is  aperiodic. 

Now  suppose  $ is  ^“irreducible  and  lives  on  a general  state  space  X.  Further 
suppose  that  there  exists  a sequence  of  nonempty  disjoint  sets  D0 , Di,...,  Dd_i  for 
some  d > 2.  Then  $ is  periodic  with  period  d if  for  alii  = 0, 1, . . . , d — 1 and  all 
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x G D{  , 

P{x,  Dj ) = 1 for  j — i + 1 (mode?)  . 

Otherwise,  $ is  aperiodic.  The  existence  of  a minorization  condition  will  ensure  that 
$ is  aperiodic. 

For  many  Gibbs  samplers,  aperiodicity  and  irreducibility  follow  from  the  fact 
that  P(x,  A)  > 0 if  A has  positive  measure  with  respect  to  the  dominating  measure. 
That  is,  it  is  possible  to  reach  any  set  of  positive  measure  from  any  starting  point 
in  one  step.  Similarly,  irreducibility  follows  for  a Metropolis-Hastings  chain  if  it  is 
based  on  a proposal  density  q(y\x)  > 0 for  every  ( x,y ) while  aperiodicity  follows  if 
P(x,{x})  > 0 for  some  x G X. 

A -^-irreducible  chain  $ is  recurrent  if,  for  each  A with  ip  (A)  > 0, 

Pr(Xn  G A i.o.  |X0  = a;)  > 0 for  all  x G X 

Pr(Xn  G A i.o.  |X0  = *)  = 1 for  -^-almost  all  x G X 

In  this  case,  $ admits  a unique  (up  to  constant  multiples)  invariant  measure  n such 
that  7 r is  also  a maximal  irreducibility  measure  for  3>.  The  chain  is  Harris  recurrent 
if  Pr(X„  G A i.o.  |X0  = x)  = 1 for  all  x G X.  Finally,  if  the  invariant  measure  n 
has  finite  mass  then  the  chain  is  said  to  be  positive  recurrent  or  positive  Harris.  The 
transition  kernel  for  an  irreducible  Gibbs  sampler  is  usually  absolutely  continuous 
with  respect  to  the  invariant  measure  and  this  is  enough  to  ensure  Harris  recurrence. 
On  the  other  hand,  7r-irreducibility  immediately  implies  that  the  Metropolis-Hastings 
algorithm  is  Harris  recurrent. 

2.7.2  Total  Variation  Distance  and  Ergodicity 

In  Section  2.1  and  Subsection  2.2.1  we  introduced  two  different  expressions  for 
total  variation  distance.  In  this  subsection  we  show  that  they  are  equivalent.  We  also 
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give  a third  representation  of  the  total  variation  norm.  We  will  then  give  a proof  of 
the  monotonicity  in  the  ergodicity  result  (2.9). 

Let  //  and  v be  two  probability  measures  defined  on  the  same  measurable  space 
(X,B).  The  total  variation  distance  of  //  and  u is  defined  as 

||//-  u\\  = sup|/i(i4)  - v(A)\  . (2.37) 

AeB 


Theorem  2.7.1.  If  a is  any  a-finite  measure  which  dominates  //  and  v then 

\\fi  ~v\\  = lri  “ r2l  da  (2-38) 

where  r\  = dyi/da  and  r 2 = du/ da  are  Radon-Nikodym  derivatives. 

Proof.  (Billingsley  1968,  p.  224)  Let  <p  = r\  - r2.  Then  we  have 

(p(x)a(dx)  — j [IA(x)  + lAc(x)]4>(x)a(dx) 

so  that 

| fiAzMzMdx) 

and  hence 

\n(A)  - u(A)\  = IA{x)(p{x)a<ydx) 

= \ |jy  IA{x)<t>{x)a(dx) 

< ~ ^ IA(x)\<t>(x)\a(dx)  + J IAc(x)\<l>(x)\a(dx) 

= \<Kx)\a(dx) 

- \f  M*)  ~r2(x)\a{dx)  . 


+ 


/ 


IAc(x)(j)(x)a(dx) 


J I a*  ( x)(p(x)a(dx ) 


The  supremum  is  achieved  with  A = {x  : r i(x)  — r2(x)  > 0}. 


□ 


Corollary  2.7.1. 
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\\V-  v\\  = sup  - 
l/l<i  2 


J f(x)n(dx)  - j f(x)v(dx) 


Proof.  (Billingsley  1968,  p.  224) 


(2.39) 


\\f  f {x)n(dx)  — J f(x)u(dx) 


= \ |y  f(x)[ri(x)  - r2(x))a(dx) 

- \ J lri(x)  “ r2(x)\a(dx) 

= \\V  “HI- 


The  supremum  is  achieved  when  A = {x  \ ri(x)  — r2(x)  > 0}  and  / = I a — 1 a*-  □ 

Recall  the  fundamental  ergodicity  result  in  (2.9).  That  is,  for  a Markov  chain 
satisfying  assumption  (^4)  and  having  transition  kernel  P and  invariant  measure  tx 


\\Pn(x,  •)  — 7r(-)  ||  l 0 as  n — > oo,  (2.40) 

Note  that  if  we  put  (2.39)  and  (2.40)  together  then  for  every  function  \h\  < 1 (and 
hence  every  bounded  function)  we  have 


| Pnh  — nh\  — y 0 n — > oo  . 


which  will  yield  convergence  in  distribution.  Alternatively,  we  have  directly  from 
(2.40)  that 

\Pn(x,  A)  — 7t(A)|  0 

for  every  ^-continuity  set  A which  is  equivalent  to  convergence  in  distribution  (The- 
orem 25.8  Billingsley  1995).  Now  for  the  monotonicity  in  (2.40). 

Theorem  2.7.2.  Let  A denote  the  initial  distribution.  If  tt  is  invariant  for  P then 

J A (dx)Pn(x,  •)  — 7 r 

is  non-increasing  in  n. 
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Proof.  (Meyn  and  Tweedie  1993,  p.  323)  Recall  Corollary  2.7.1  and  that  since  ir  is 
assumed  invariant  we  have  7r  = ttP.  Then 


J X (dx)Pn(x,  •)  — 7T 

= sup  I f A (dx)Pn(x,dy)f(y)  - f ir(dy)f(y) 

— sup  f X(dx)Pn~1(x,  du) 

l/l<i  \J 

< sup  [ X(dx)Pn~1(x,du)f(u)  — f ic(du)f(u) 

where  the  inequality  follows  since  whenever  |/|  < 1 we  have  \Pf\  <1.  □ 

2.7.3  Converting  Between  Drift  Conditions 

In  Section  2.3  we  introduced  two  types  of  drift  conditions  that  may  be  used  to 
establish  geometric  ergodicity  of  a Markov  chain  4>.  The  following  lemma  is  useful 
for  converting  between  these  drift  conditions.  Recall  from  Subsection  2.7.1  that  for 
a real-valued  B-  measurable  function  / we  have 

Pf(x)  = E[f(Xn+1) \Xn  = x\. 

Lemma  2.7.1.  Let  $ be  a Markov  chain  satisfying  assumption  (A).  Suppose  there 
exists  a function  V : X i— >■  [1,  oo),  a 0 < A < 1,  and  a b < oo  such  that 

PV{x)  <XV{x)  + b (2.41) 


J P(u,  dy)f(y)  - J -ir(du)  J P(u,dy)f(y) 


for  all  x € X . Then  for  any  a > 0 

PV{x)  < X'V{x)  + blc 


where  X1  = (a  + A) /(a  + 1)  and 


C = 


|x  £ X : V[x)  < 


(a  + 1)6  1 

a(l-A')i  ' 


(2.42) 
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Proof.  Set  AV  = PV  - V and  /?  = (!-  A )/(o  + 1).  Then 


or,  equivalently, 


PF(x)  < [l-(a  + l)/9]V(x)  + 6 


AV(x)  < —fiV(x)  — a/3V(x)  + b 


for  all  x € X.  If  x £ C then 

v(x)  > (a+1)6  > (°  + 1>i  = ± 
' ' o(l  - A')  a(l  - A)  o0 

Now  write  V(x)  = ^ + s(x)  where  s(x)  > 0.  Then 


AV(x)  < -0V(x)  - a/3 

= —/3V(x)  — a/3s(x) 
< -PV{x). 


*(*) 


+ 6 


If,  on  the  other  hand,  x € C then 


AF (x)  < ~PV (x)  — a/3V (x)  + b < —pV (x)  + b 


Now  putting  these  together  gives 


PV(x)  < (1  - P)V(x)  + blc  = A'V(x)  + blc 


□ 


2.7.4  Moment  Generating  Functions  of  Hitting  Times 

Suppose  that  there  is  an  energy  function  V \ X ^ oo)  satisfying  the  drift 
condition 


PV  < XV  + blc 


(2.43) 
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where  0 < A < 1 and  b < oo.  Moreover,  we  assume  that  there  is  an  e > 0 and  a 
probability  measure  Q such  that 

P(x,-)>eQ(-)  VxeC.  (2.44) 

The  hitting  time  of  the  set  C is  defined  as  tc  = min{n  > 1 : Xn  G C}.  In  this 
section  we  provide  a proof  of  the  fact  that  the  moment  generating  function  of  tc 
exists,  i.e.,  Py(/3TC)  < oo  for  some  /3  > 1 when  (2.43)  and  (2.44)  hold  (Meyn  and 
Tweedie  1993,  Lund  and  Tweedie  1996).  More  specifically,  we  will  show  that 

Ey(/3TC)  < /3(\m  + b)  for  y G C 

Ey(/3TC)  < V(y)  for  y e Cc  . (2.45) 

where  m = sup yeCV(y)  and  1 < j3  < 1/A. 

For  A G B and  x G X let  Px(ta  = k)  = Pr(r,4  = k\X0  = *).  Note  that 
Px{ta  = 1)  = P(x,A).  Also,  as  a consequence  of  the  Markov  property  we  have  for 
each  fixed  integer  k > 1 

Px[ta  — k)  = [ P{  x,  dy)Py{rA  = k-  1) 

J A* 

= f P{x,dyx)  [ P(yi,dy2)---  /*  P(yn_2,dyn_1)P(yn_1,A) 

J Ac  J Ac  J Ac 

This  is  Proposition  3.4.5  of  Meyn  and  Tweedie  (1993). 
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+ 


Now  for  (2.45).  When  y G Cc  we  can  use  (2.43)  to  obtain 

V(y)  > \~XPV (y)  = A-1  V (z)P(y,  dz) 

= A"1/  V(z)P(y,dz)  + \-1  [ V(z)P(y,dz) 

Jcc  Jc 

> A-1  [ V(z)P(y,  dz)  + A_1Py(rc  = 1) 

Jcc 

> A-1  J |a"x  J V(u)P{z,du)^  P{y,dz)  + \-1Py(TC  = 1) 

= A-2  [ \f  V (u)P(z,  du)  + f V(u)P(z,du)  P(y,dz) 

Jcc  Ucc  Jc 

A ~lPy(TC  = 1) 

> A”2  [ [ V(u)P{z,du)P{y,dz)  + X-2Py(Tc  = 2)  + \-1Py(TC  = l) 

Jcc  Jcc 

Continuing  in  this  fashion  we  obtain  for  all  k > 1 

k 

V(y)  >^2\~nPy(TC  = n)  . 

n= 1 

Now  let  k — » oo  on  both  sides  to  get,  for  all  1 < /3  < A-1, 

00 

V(y)  > Y,  = n)  = Et( \-'°)  > E,(F°) . 

n=l 

Now  suppose  that  y G C and  consider  the  transition  y -4  x. 

Ey(0T°)  = E[E,(/r°\x)]  = [ Et(^\x)P(y,dx) 

J X 

= [ Ey(t3Tc\x)P(y,dx)+  [ Ey(PT°\x)P(y,dx) 

Jc  Jcc 

= ( PP(v,dx)+  [ PEm(F°)P{y,dx) 

Jc  Jc° 

< [ /3V(x)P(y,dx)+  [ fJV(x)P(y,dx) 

Jc  Jcc 

= P [ V(x)P{y1dx) 

J x 

= PPV(y ) 


< /3(A m + b) 
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where  the  first  inequality  follows  from  our  previous  work  and  that  V > 1.  The  second 
inequality  follows  from  the  drift  condition  (2.43). 

2.7.5  Alternative  Total  Variation  Bounds 

Cowles  and  Rosenthal  (1998)  and  Roberts  and  Tweedie  (1999)  have  recently  pro- 
vided alternatives  to  Rosenthal’s  (1995a)  bound.  The  first  theorem  we  state  is  a direct 
generalization  of  Rosenthal’s  (1995a)  bound  given  by  Cowles  and  Rosenthal  (1998). 
We  then  present  two  results  of  Roberts  and  Tweedie  (1999).  The  first  theorem  of 
Roberts  and  Tweedie  that  we  state  holds  for  any  number  of  iterations,  i.e.,  as  long  as 
n>  1,  while  the  second  result  holds  only  for  “large”  n.  However,  for  sufficiently  large 
n the  second  result  is  a strict  improvement  over  the  first.  Throughout  we  assume  that 
the  Markov  chain  $ satisfies  assumption  (.4).  Here  are  slightly  simplified  versions  of 
Cowles  and  Rosenthal’s  (1998)  and  Roberts  and  Tweedie’s  (1999)  theorems.  (Note 
that  we  have  included  the  correction  described  in  Roberts  and  Tweedie  (2001).) 

Theorem  2.7.3.  (Cowles  and  Rosenthal  1998)  Suppose  that  $ satisfies  the  following 
drift  condition.  For  some  V : T ki  M+,  some  0 < A < 1 and  b < oo 


Let  C = {x  : V(x)  < d}  where  d is  any  number  larger  than  26/(1  — A).  Suppose 
that  <f>  satisfies  the  following  minorization  condition.  For  some  probability  measure 
Q and  some  e > 0, 


E[V(Xi+i)\Xi  = x]<  XV  (x)  + b VxeX. 


(2.46) 


P(x,  •)  > eQ{-)  V x e C . 


Let  X0  = x0,  M > 0 and  define  some  constants  as  follows 


a 


j 1 + 2 Mb  + MXd 
= 1 + Md 


A = l + 2(A  Md  + Mb) 


and 
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Then  for  any  0 < r < 1 we  have 


|P>oO  - *(-)||  < (1  - e)rn  + C„  (aA)~l  (a~^A')n 


If  furthermore  it  is  known  that  V(x)  > 1 for  all  x 6 X,  then  it  suffices  that  d > 
— l,  and  these  values  may  be  improved  slightly  to 


a 1 = A + 


M6  + (1-A)(l-M) 

l + ¥(d-l) 


A = M(\d  + b)  + (1  - M) ; 


Co  = 


M 


1 - A 


+ V(x0)  + (1  - M) . 


Theorem  2.7.4.  (Roberts  and  Tweedie  1999)  Suppose  that  4>  satisfies  the  following 
drift  condition.  For  some  V : X [1,  oo),  some  0 < A < 1,  and  some  b < oo 

E[V(Xi+1)\Xi  = x]  < \V{x)  + bIc(x)  VxeX.  (2.47) 

where  C = {x  : V (x)  < d}  and  d is  any  number  larger  than  — 1.  Suppose 

further  that  $ satisfies  the  following  minorization  condition.  For  some  probability 
measure  Q(-)  and  some  £ > 0, 


P(av)  > eQ{-)  VxeC. 


Let  Xo  = x0  and  define  some  constants  as  follows 


p — A + 


C = log 


2(1  + d ) 

b 


J = 


.2  \1  ~ P 


+ w(x0) 


2 (pd  — £■)(!  + (/)  + &(  1 + 2d) 
2(1  + d)p 

( log(p)  log(l  -e) 


J,  = 


J 


13  rt  = exp 


V log  Je 


<t> 


1—  £ 

log  Wrt) 
log  (p_1) 


There  are  two  cases.  If  J >1  we  have  for  any  n > 0 and  any  1 < (3  < (3rt 

iip"(»o.  •)  - *oii  < co1  ~(fi  • 

1 - (1  - £)Jf 


When  J <1  the  same  bound  holds  for  any  n > 0 except  that  in  this  case  (3rt  = P 1 ■ 
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Theorem  2.7.5.  (Roberts  and  Tweedie  1999)  Suppose  that  $ satisfies  the  following 
drift  condition.  For  some  function  V : X i— >•  [l,oo),  some  0 < A < 1,  and  some 
b < oo 

E[V{Xi+i) \Xi  = x]<  XV (x)  + blc{x)  V x<EX. 

where  C = {x  : V (x)  < d}  and  d is  any  number  larger  than  ~ 1-  Suppose 

further  that  $ satisfies  the  following  minorization  condition.  For  some  probability 
measure  Q(- ) and  some  e > 0, 

P(x,-)>eQ{-)  VieC. 


Let  Xq  = Xq  and  define  some  constants  as  follows 


b 2(pd  — £)(1  T d)  -T  6(1  -T  2d) 

+ 2(1  + d)  = 2(1  + d)p 

l°g  [I  (A  + mM)]  log  [(1  ~ 

log(p_1)  log(p_1) 

There  are  two  cases.  First  if  J > 1 set 


d = exp 


log  P log(l  ~ g) 

log  J — log(l  — e) 


If  n'  = n — £ > r](  1 — e)/e,  we  have 


|Pn(x0,-)  -tt(-)II  < 1 


$(1  — e) 


(1  + n' /r])(l  + r}/n')n'^d~nl . 


(1  + rj/n')1^  t 

On  the  other  hand,  if  J < 1 set  d = p~l.  If  n!  > r/(l  — e)/[pn  — (1  — e)],  then  we  have 

[l-tf(l-e)]  fl(A  + V(x0))l 
||Pn(x„,-)-)r(-)||  < 1 ;'tL_  . 


The  main  difference  between  the  hypotheses  of  the  theorems  in  this  subsection 
is  the  form  of  the  drift  condition  and  this  difference  seems  to  important  in  practice. 
Specifically,  it  is  our  experience  that  establishing  (2.46)  is  much  easier  that  estab- 
lishing (2.47),  especially  for  Gibbs  samplers.  We  can,  of  course,  use  the  conversion 
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lemma  of  Subsection  2.7.3  given  a drift  condition  of  the  form  (2.46).  Unfortunately, 
it  appears  that  whatever  one  would  gain  by  using  the  results  of  Roberts  and  Tweedie 
(1999)  is  negated  somewhere  in  the  conversion.  This  is  why  we  did  not  use  their 
results  in  our  examples  in  Subsections  2.1.3  and  2.6.3. 


CHAPTER  3 

ESTABLISHING  DRIFT  AND  MEMORIZATION  CONDITIONS 

3.1  Introduction 

In  the  previous  chapter  we  focused  on  the  connections  between  geometric  ergod- 
icity  and  obtaining  rigorous  answers  to  (Ql)  and  (Q2).  Here  we  are  interested  in  how 
to  go  about  establishing  geometric  ergodicity.  This  means  that  we  spend  most  of  this 
chapter  verifying  drift  and  minorization  conditions. 

Most  of  the  research  in  the  area  of  convergence  rate  bounds  has  rightly  focused 
on  establishing  theoretical  results  as  in  Theorems  2.3.1  and  2.4.1.  But  little  work  has 
been  done  about  actually  applying  these  results  in  realistic  settings.  Indeed,  prior  to 
the  work  in  this  chapter,  perhaps  the  most  complicated  model  for  which  analytical 
bounds  have  been  established  for  a Gibbs  sampler  on  a continuous  state  space  is 
contained  in  Rosenthal  (1996).  Specifically  he  considered  a Gibbs  sampler  for  the 
following  conditionally  independent  hierarchical  model.  Suppose  for  i — 1 ,K 
that 


Yi\9i  ~ N(0i,a) 

~ N(/x,A)  (3.1) 

A~IG(6,c)  /(//)  cxl 

where  a,  b and  c are  known  positive  constants.  Since  a is  assumed  known  this  is  not 
a very  useful  (or  realistic)  model.  Moreover,  MCMC  techniques  are  not  required  to 
sample  from  the  associated  posterior.  Indeed,  we  will  develop  a rejection  sampling 
algorithm  that  will  allow  us  to  obtain  iid  samples  from  the  posterior. 
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As  we  noted  earlier,  we  are  aware  of  only  a very  few  substantive  applications  of 
regenerative  simulation  in  the  statistics  literature.  We  believe  that  this  is  at  least 
partly  due  to  the  fact  that  the  required  minorization  condition  is  notoriously  difficult 
to  establish.  We  exploit  the  recipe  of  Mykland  et  al.  (1995)  to  establish  minorization 
conditions  for  several  Gibbs  samplers.  A general  method  for  minorizing  Metropolis- 
Hastings  chains,  also  due  to  Mykland  et  al.  (1995),  is  presented. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  3.2  we  present  a 
more  general  version  of  the  hierarchical  random  effects  model  (Ad)  we  originally 
presented  in  Section  2.1.2.  We  also  describe  both  the  block  Gibbs  and  univariate 
Gibbs  samplers  for  sampling  from  this  posterior.  Then  in  Sections  3.3  and  3.4  we 
establish  drift  and  minorization  conditions  for  the  block  Gibbs  and  univariate  Gibbs 
samplers,  respectively.  In  Section  3.5  we  consider  how  to  establish  the  minorization 
required  to  implement  RS  in  general  hierarchical  linear  mixed  models.  Next  we 
consider  two  data  augmentation  settings  in  Section  3.6.  Section  3.7  gives  a method  for 
constructing  minorization  conditions  in  Metropolis-Hastings  samplers.  In  Section  3.8 
we  introduce  a class  of  random  effects  models  and  consider  two  versions  of  the  Markov 
chain  Monte  Carlo  EM  algorithm  that  are  useful  for  maximizing  the  likelihoods  of 
these  models.  Finally,  in  Section  3.9  we  show  how  one  might  obtain  iid  draws  from 
the  posterior  associated  with  the  model  in  (3.1). 

3.2  A Hierarchical  Random  Effects  Model 
Consider  the  following  Bayesian  version  of  the  usual  Normal  theory  one-way  ran- 
dom effects  model.  First,  conditional  on  6 = (#i, . . .,0k)t  and  Ae,  the  data,  Y^,  are 
independent  with 

^IMe-N^A;1) 
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where  i = and  j = 1, . . . , raj.  At  the  second  stage,  conditional  on  y and  A e, 

0 and  Ae  are  independent  with 

0\y,  Xe  ~ N(/itl,  Ag  1I)  and  Ae  ~ Gamma(a2,  b2), 

where  1 is  a K x 1 column  vector  of  ones,  I is  a K x K identity  matrix,  and  a2 
and  b2  are  known  positive  constants.  (We  say  W ~ Gamma(a,  /3)  if  its  density  is 
proportional  to  wa~le~wPI(w  > 0).)  Finally,  at  the  third  stage,  y and  A e are  assumed 
independent  with 

y ~ N(/r0,  A^1)  and  A e rsj  Gamma(ai,  hi) 

where  /Uo,  Ao,  ai,  hi  are  known  constants;  all  but  /io  are  assumed  to  be  strictly  positive 
so  that  all  of  the  priors  are  proper.  Also,  we  assume  that  K > 3 and  that  ra*  > 2 for 
all  i. 

The  posterior  density  is  characterized  by 

tt(0,  /i,  Ae,  X6\y)  oc  f(y\0,  Xe)f(9\y,  Xe)  f (Xe)  f (y)  f (Xe)  (3.2) 

where  y is  a vector  containing  all  of  the  data,  and  / denotes  a generic  density.  The 
integrals  required  for  inferences  through  this  posterior  are  not  available  in  closed  form. 
Thus,  we  must  resort  to  (possibly)  high  dimensional  numerical  integration,  analytical 
approximations  or  MCMC  techniques  like  the  Gibbs  sampler. 

In  their  seminal  paper  on  the  Gibbs  sampler,  Gelfand  and  Smith  (1990)  used  the 
balanced  version  of  this  model  (in  which  raj  = m)  as  an  example.  These  authors 
described  the  standard  Gibbs  sampler  in  which  a cycle  consists  of  univariate  updates 
of  each  of  the  K-\-  3 parameters.  (See  also  Gelfand  et  al.,  1990  and  Rosenthal,  1995b.) 
The  univariate  conditional  densities  that  are  required  to  implement  the  Gibbs  sampler 
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are  as  follows 


\o\\e,  /x,  0 , ~ Gamma  ^y  + oi,  h + ^ - nY 

Ae|A0,  y,  6 ~ Gamma  ^y  + ^ ~ 0*)2^ 


l\  x f)  m ( ^o/i°  + 1 

fi\\e,\e,0  ~ N(^  Aq  + ^A,  ’ A0  + i^Afl 


0*  I A#,  Ae,  /x,  0 - 


N 


Afl/x  + mi\eyi 


Xg  + niiXe  X g + ?7XjA, 
where  0_j  - (0i, . . . , 0j_i,  0*+i, . . . , 0K)T,  M = Himi’  ^ 


* = !>•■•»  K 

K~l  Hi  and  Vi 


Hobert  and  Geyer  (1998)  also  introduced  the  more  efficient  block  Gibbs  sampler 
in  which  all  of  the  Normal  components  are  updated  simultaneously.  (Recall  that 
the  balanced  data  version  of  this  block  Gibbs  sampler  was  our  main  example  in 
Chapter  2.)  More  specifically,  let  £ = (0i, . . . , 9K , y)T  and  A = (A g,  Xe)T  and  define 
K k 

Mi)  = 5^(0i  - y)2  and  v2{i)  = ~ Vi)2  and  SSE  = ~ ^ ' 

i=l  i=l  i,j 

The  full  conditionals  for  the  variance  components  can  now  be  written  as 


Xg\£,  Ae,  y ~ Gamma 


K 

y + ax, 


Mi) 

2 


+ bi 


and 


Ae|£,  Xg,  y ~ Gamma 


( M Mi)  + SSE 

I — + a2i  ^ k ^2 


\2  ' 2 

Hobert  and  Geyer  (1998,  p.418)  show  that  £|A ,y  ~ N(£*,V)  and  give  the  specific 
forms  of  i*  = £*(A,  y)  and  V = V(A,  y).  Because  we  will  make  extensive  use  of 
them  they  are  restated  here.  First,  we  let 

K 


* = E 


2=1 


THiXgXe 

X o + rriiXe 
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then 


Finally, 


and 


Var(0i|Afl,  Ae)  = 
Cov(0j,  9j\\e,  Ae)  = 
Cov(0j,  n\\o,  Ae)  = 
Var(^|Ae,  Ae)  = 


A e + rrii\e 


1 + 


(Ae  + mjAe)(Ao  + i)_ 


A2 


(A  g + nriiXe)(Xe  + rrij  Ae)(Ao  + t) 
Ae 


(A#  + ra,Ae)(Ao  + t ) 

1 

Aq  + 1 


E(li\ A) 


1 

Aq  + t 


mjXgXeyj 
Xg  + mjXe 


+ ^oAo 


(3.3) 


(3.4) 


£(0<|A)  = 


Xg 

Xg  + miXe 


An  + t 


K 


ETTljXgXey  a 

- — , — ;■  + moAq 


i= i 


Ae 


rrijAe 


+ 


A efflil/j 

Ae  + mjAe 


Observe  that  the  expression  in  (3.4)  is  a convex  combination  of  {yt}  and  /r0.  Note 
that  E(0i\X)  is  a convex  combination  of  E(/jl\ A)  and  y{.  Let  A denote  the  length  of  the 
convex  hull  of  the  set  {yi,  f/2, . . . , ?/*-,  ^0}-  It  now  follows  that,  for  any  i = 1, 2, . . . , K, 
[£(0*|A)  - E(n |A)]2  < A2  and  [J5(0<| A)  - &]2  < A2. 

When  the  data  are  balanced,  i.e.,  m*  = m,  we  can  obtain  a more  useful  bound  on 
[E(6i\X)  — E(y | A)]2  and  [E(9i\X)  — yi]2.  In  this  case, 


MXgXe 
Xg  + mAe 


so  that 

ty  + ^qAq 
Aq  + t 
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where  y is  the  overall  mean  of  the  data.  Hence  for  alii  = 1, . . . , K we  have 


WI-M-p,]2  = 


{ ty  + ) , A.mp, 

Ac 

1 


Ag  + mAe  ^ Ao  + t ) Ag  + vnAe 
Ag 


A q + T7l\e 


An  + t 


+ /h)Ao  — (Aq  + t)yj\ 


< 


tjy-Vi ) + Aq(^q  - yt) 
An  + t 


1 2 


which  is  a convex  combination  of  /i0  — ?/;  and  y — y It  follows  that 


[E(0i | A)  - yj2  < [(fio  - y{)  - {y  - y^)]2  = (mo  - yf 


A similar  argument  shows  that  when  the  data  are  balanced 


[E(di\X)  — E(y\X)]2  < (y,0  — yf 


for  any  i = 1,  2, . . . , K. 

We  are  now  in  a position  to  establish  the  drift  and  minorization  conditions  required 
for  geometric  ergodicity.  In  Section  3.3  we  do  this  for  the  block  Gibbs  sampler  while 
the  calculations  for  the  Gibbs  sampler  are  deferred  until  Section  3.4. 


3.3  Drift  and  Minorization  for  the  Block  Gibbs  Sampler 
In  this  section  we  establish  drift  and  minorization  conditions  for  the  block  Gibbs 
sampler.  In  Subsection  3.3.1  we  establish  a drift  condition  for  the  balanced  case, 
i.e.,  rrii  = m for  alii  = while  in  Subsection  3.3.2  we  consider  the  more 

general,  i.e.,  possibly  unbalanced,  case.  In  Subsections  3.3.3  and  3.3.4  we  consider 
two  different  ways  of  establishing  the  required  minorization  condition. 


3.3.1  Drift:  Balanced  Case 

First  we  establish  a drift  condition  of  the  form  (2.41).  Define  two  constants  as 


8i  = 


1 

2gi  + K — 2 


and  82 


1 

2(i2  4"  Af  — 2 


follows 
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Also,  let  S3  = K 82  and  64  = (K  + 1)  62 . It  follows  from  our  assumptions  about  K and 
M = mK,  that  0 < 5*  < 1 for  i = 1,2,3, 4.  Let  <5  = max{5i,54}.  Choose  A G (5,1) 
and  then  choose  (f>  > 0 such  that  <pS3  4-  6 < A.  Define  the  energy  function  as 

V(0  = <M(0  +m-1v2{£) 

where  iq(£)  and  v2(£)  are  as  defined  in  Section  3.2.  We  will  show  that 

£[U(£)|A',  £']  < cf>S1v1(£')  + (</>S3  + S4)m-1v2((')  + b (3.5) 


for  a constant  b < 00  and  this  implies 

£[U(0|A',£']  < tSvxi?)  + (<A53  + 5)m-1«2(^)  + b < \V(£)  + b.  (3.6) 

(Once  again,  we  are  writing  the  one-step  transition  as  (A\  £')  — >■  (A,£)  instead  of 
(A„,£n)  -*  (An+i,£n+1)  in  order  to  avoid  excessive  subscripting.) 

We  will  calculate  the  required  expectations  using  the  following  rule: 

£[V«)|A',<1  = £[V«)I«1  = -B{J5[K(«)|A]|«'}.  (3.7) 


We  begin  with  some  preliminary  calculations.  First,  note  that 

2&i  , «i(£') 


E ^ ^ 2ai  + K - 2 + 2ai  + K - 2 


and 


^(Ae-1|0  = ^2+^  + 


«*(*') 


= ^ + 5^(0  (3.8) 


= C2  + WO-  (3-9) 


2a2  -(-  M — 2 2a2  -h  M — 2 

We  now  begin  the  main  calculation.  Using  our  rule,  we  have 

K ( K 

£M0I«1  = I> [(«<-#< W]  =e  (£>[(<>,- **)2|a]  |«' j.  ■ 


1=1 


1=1 
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Using  the  results  stated  in  Section  3.2,  we  have 


E [(0j  - n)2\X)  = Var(0j|A)  + Var(/x|A)  - 2Co v[(0j,  /x)|A]  + [£?(0<|A)  - E{ji\ A)]2 


1 A$  + (A#  + mXe)2  — 2Xg(Xg  + m\e) 

Xg  + mXe  (Aq  + t)(Xg  + mXe)2 


+[E(6i\\)  — E(n\X)f 


+ 


m2  A2 


Xg  + mXe  (Ao  + t)(Xg  + mXe)2 

1 17l\e  , ^2 

- m\c  + i(A»  + mAlj  + “ f‘o) 


+ [B(«j|A)  - £(/j|A)]! 


Hence, 

X £ [(ft  - u)2|A]  < ^A;1  + Ag 1 + A-(S  - /*,)2  ■ (3.10) 

i= 1 

Combining  (3.10)  with  (3.8)  and  (3.9)  yields 
E [(/>Vi(£)\£']  < 8i<J>Vi(£') S^m  ^(O  + </*[Ci  + Km  1C<2  + K(y  — //o)2]  • (3-11) 


Now 

£ [m-S(£)|t']  = X B [(ft  - Pi)2!?']  =^{X£[W-  ^)2IA1  l«' 

i l i 

Again  using  results  from  Section  3.2,  we  have 


E[{9i-yt)2  |A]  = Var(0,|A)  + [E(0,|A) 

1 


Vi? 


< 


Xg  + mXt 
1 


+ 


\2 

A0 


+ [B(ft|A)  - y,? 


+ 


(Ao  + t)(Xg  + mXe)2 

A<? 


mXe  t(Xg  + mAe) 


+ (y  - Mo)' 


Hence, 

X E [(ft  - Vi)2|A]  < l^ihA,-1  + K(y  - /jo)2 

i=l 

Then  combining  (3.12)  with  (3.9)  yields 


£[m  *«2(£)|£']  <S4m  V2(0  + 


(B  + l) 


C2  + K(y  - /<„)2  . 


(3.12) 


(3.13) 


m 
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Finally,  combining  (3.11)  and  (3.13)  gives  (3.5)  with 


b = <\>CX  + [( (t>K  + K + 1 )/m\  C2  + K{<f>  + 1 )(y  - Mo)2  • 


3.3.2  Drift:  General  Case 

Define  another  constant  S5  = d2  m^1  and  note  that  0 < S5  < 1.  Recall  that 
5 = max{5l5  <54}.  Choose  A £ (£,  1)  and  0i  > 0 and  02  > 0 such  that 


01^5  + 02 <5 


< A. 


Now  define  the  energy  function  to  be 


— ^ lwl(£)  + 02U2(O-  (3-14) 


The  energy  function  in  the  previous  subsection  is  easily  seen  to  be  a special  case  by 
taking  02  = 1/m.  We  will  show  that 

■®[y(Ol^f)  5:  01^1  vl(£)  + (01^5  + 02<^4)^2(^)  + b (3.15) 


and  this  implies 


£7[V(^)|A',  i']  < 0i^i (O  + 702«2(^)  + b<\V(Z)  + b 


As  in  the  previous  subsection  we  will  calculate  the  required  expectations  using 
(3.7).  Note  that  (3.8)  and  (3.9)  still  hold. 

We  now  begin  the  main  calculation.  Using  our  rule,  we  have 

BMOie']  = £ E [w  - m)2I€']  = b If; b [(ft  - ^)2|A]  \C 


i= 1 


i=l 


86 


Using  results  from  Section  3.2,  we  have 


E [(9i  - yu)2|A]  = Var(0j|A)  + Var(^|A)  - 2Cov[(0i,  /i)|A]  + [E(9i |A)  - E^\X)f 
_ 1 Ag  + (A#  + mi\e)2  — 2Xg(Xg  + mjAe) 


Xg  + rriiXe  (A0  + t)(Xg  + mj-Ae)2 

+[E(9i\X)-E^\X)}2 
1 


™?A2 


Xg  + m,Ae  (Ao  + t)(Xg  + THiXe )2 


+ [£(0i|A)  -£(/z|A)]2 


1 rriiXP 

< — - + 


rriiXe  t(Xg  + m,Ae) 


+ A2  . 


Hence, 


K 


K 


E [(#i  - y)2|A]  < A,-1  y m-1  + A,-1  + A- A2  . 


(3.16) 


2—1 


2=1 


Thus,  by  combining  (3.8),  (3.9)  and  (3.16),  we  obtain 


(^)  < ^l<Alul(0  + ^501^2(^0  + </>l 

Now 


K 


+ +K  A2 


2=1 


(3.17) 


E M£)|«']  = J>i-E  [(«.  - Si)2|£']  = E miE  [(ft  - y,)2|A]  |£'|  . 


We  can  bound  the  innermost  expectation  as  follows. 


Em-ViY |A]  - Var(0j|A)  + [E(9i\X)  — yif 

1 


+ 


Xg  + niiXe  (Ao  + t)(Xg  + mjAe)2 


+ [E(9i\X)-yif 


< 


1 


+ 


Xg 


rriiXe  t(Xe  + rriiXe ) 


+ A2 


Hence 


K 


X>i£[(0i  - Si)2|A]  < (K  + l)Ae->  + 11A! 


(3.18) 


2=1 


and  so  by  combining  (3.9)  and  (3.18)  we  obtain 


E[4>2V2{£)  < ^402^2(^0  + 4> 2 \(K  + 1)C*2  + M A2]  . 


(3.19) 
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Combining  (3.17)  and  (3.19)  yields  (3.15)  with 


b — + C*2  ^ ) 77 ii  1 + fa(K  + 1) J + faM)  A2. 

We  will  establish  two  different  minorization  conditions  in  the  following  two  subsec- 
tions. The  first  is  based  on  Rosenthal’s  (1995a)  Lemma  6b  while  the  second  is  based 
on  a technique  described  in  Mykland  et  al.  (1995).  In  both  sections  we  consider  only 
the  minorization  corresponding  to  the  energy  function  in  (3.14). 


3.3.3  Minorization  I 

Fix  d > 0 and  note  that  S = {£  : V(£)  < d}  is  contained  in  Cs  — Cs1  fl  Cs2 
where 


CSl  = {£  ■ «i(£)  < d/ fa}  and  CS2  = {£  : v2{£)  < d/fa}  - 


Hence,  it  suffices  to  establish  a minorization  condition  that  holds  on  Cs-  We  know 


that  (2.14)  is  satisfied  if  we  can  find  a density  q( A,  £)  on  R2  x R^+1  and  an  e > 0 


such  that 


fr(A*|£,,y)7r(Ae|^y)7r(£|A,y)  > eg(A,£) 


for  all  £'  € Cs-  Now  if  £'  £ Cs,  we  have 


7r(Ae|^,,t/)7r(Ae|^/,y)7r(^|A,y)  > 7r(£|A,y)  inf  [7r(Afl|^,y)7r(Ae|^,y)] 


> tt(€|A,  y) 

inf  7r(A*|£,y) 

5 

inf  7r(Ae|£,y) 

> tt(^|A ,y) 

inf  7r(Ae|£,y) 

inf  7r(Ae| £,y) 

_teCs2 

The  density  we  use  is 


q(\£)  « ?r(£l \y) 


inf  7r(A0|^,y) 


inf  7r(Ae|4,y) 


(3.20) 


We  can  find  the  required  infima  with  the  help  of  the  following  lemma. 
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Lemma  3.3.1.  If  a > 1,  a > 0,  and  c > 0 are  fixed  then 

Gamma(a,  a ; x)  if  x < x* 

Gamma(a,  a + c/2;  x)  if  x > x* 

where 

x*  = ^1°g(1  + ^). 

Proof.  Let 


inf  Gamma(a,a  + ft/ 2\x)  = < 

0<(3<c 


fc{x) 

//?(*) 

fo(x) 


(a  + c/2)  l ^—x(a+c/2) 

r(«) 

(a  + p/2)a  ! x(a+0i2) 

r(a) 


r (a) 


x 


a—l„—xa 


There  is  exactly  one  solution  to  fc(x)  = /o(^)  and  it  is 

*'  = Tlog(1  + s)' 

It  suffices  to  show  that  (i)  Rq(/3 ) = fp(x)/fo(x)  > 1 for  all  x G (0,:r*)  and  all 
P G (0,  c);  and  (ii)  RC(P)  = fp(x)/fc(x)  > 1 for  all  x G (x*,  oo)  and  all  P G (0,  c). 

Before  proving  (i)  and  (ii),  we  establish  a preliminary  fact.  Let  k > 0 and  define 
a function 

h{u)  = TXTu  ~ log(1  + ku) 

for  u > 0.  Since  h( 0)  = 0 and  h'(u)  < 0,  we  know  h(u)  < 0 for  u > 0.  Hence, 


1 k 


u 1 + ku 

for  u > 0.  Define  another  function 


ir 


log(l  + ku)  < 0 


g(u)  = — log(l  + ku) 
u 


(3.21) 
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for  u > 0.  Since  the  the  left-hand  side  of  (3.21)  is  equal  to  g'(u),  we  have  established 
that  g(u)  is  decreasing  for  u > 0. 

Now  back  to  (i).  Since  x < x*,  we  have 

log  R(0)  = a log  + 

Thus,  all  we  need  to  show  is  that 

Qlog(1  + £)-Tlog(1  + ^)  >0 

whenever  f3  G (0,c),  but  this  follows  from  the  preliminary  fact  above.  Case  (ii)  is 
similar.  □ 


Applying  Lemma  3.3.1  to  the  infima  in  (3.20),  we  have 


where 


for 


and 


for 


<?(A,0  = 


hi(\e) 


L/R+fc,(A,)dA, 


h 2(Ae) 


Jr+  ^2  ( Ae)  d\e 


hx(\e)  = 


tt(€I  \y) 


Xg  < Xg 


Gamma  ( y + ax , b\ ; A#) 

Gamma  (f  + ai,  ^7  + 61;  A<?)  Xg  > X*g 


x.  = Mk±M1oJ1+ 


d 


26i  (f>i 


j Gamma  + a2,  + b2;  Xe^j  Xe  < A; 

Gamma  y + d2)  T b2]  Ae^  Ae  > A 


A,  = fe(M+2a2_)lQ8[1  + 


d 


<^2(262  + SSE)  / 


(3.22) 


Finally,  e - fR  hi(Xg)  dXg  fR+h2(Xe)dXe  and  each  of  these  integrals  can  be 
easily  computed  with  two  calls  to  the  incomplete  gamma  function. 
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3.3.4  Minorization  II 

Fix  a “distinguished  point”  £ = (6i, . . . ,6k,  fj)T  £ M^+1.  Choose  d\,  d2,  d3,  d4 
such  that  0 < d\  < d2  < oo  and  0 < d3  < d4  < oo.  Put  D = [di,  d2]  x [d3,  d4]  C M+. 
Then  we  have 


*■(*!£',  yM£|A,y)  = 


> 


.tt(A|  £,y) 

inf^AIS'.y) 


*■(*!£,  yM£|A,y) 

7r(A|^,y)7r(^|A,y)/(A  6 £>) 


-AeD  7r(A|£,y). 

Now  let  g(A,  £)  be  a density  with  support  D x RK+1  defined  as 


9(A,«)  = br(A|£®M«|A,|/)/(AeI>) 
c 


where  c — fD  7r(A|£,  y)  dX.  (Note  that  c can  be  computed  with  four  calls  to  the 
incomplete  gamma  function.)  Also,  let 


s(0  = c inf 


7r(A|£',y) 


XeD  7t(A|£,  y) 

Putting  all  of  this  together,  we  have  for  all  (A,  £)  and  for  all  (A',£') 


7r(A|^,y)7r(^|A,y)  > s(£')q(X,£)  . 


Now  let 


e = inf  s(£')  = c inf 
£'es  £'es 

Then  for  all  £'  G S',  we  have 


infh^lA). 

AG D 7r(A|^,y) 


tt(A|£',  y)  tt(£| A,  y)  > e q( A,  0 , 


which  is  the  minorization  we  need. 
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We  now  describe  how  to  calculate  e.  For  notational  convenience,  let  v\  = v,(£') 
and  Vi  = Vi(£)  for  i = 1,  2.  First 

,«')  = cinf  1r(A|*'j/)- 
XeD  tt(A| l,y) 

Gamma  (y  + ax,  b\  + \v[]  Ag)  Gamma  (y  + a2,  b2  + ^v2;  Ae) 

Aen  Gamma  (y  + ax,  6X  + |ui;  Ag)  Gamma  (y  + a2, &2  + 5^2;  Ae) 

where  b2  = b2  + SSE/2.  Now  straightforward  calculations  reveal  that 

f +02 


*(£')  = c 


2 b\  + 

f+ai 

2b2  + v2 

2 b\  + V\ 

2b2  + v2  _ 

x 


exp  1^!  - vi)  [d2  + (di  - d2)I(vi  > ^)]  + ^(ti2  - v'2)  [dA  + (d3  - d4)I(v2  > v2)]|  • 

Note  that  s(^,)  involves  £'  only  through  v[  and  v2.  Let  e*  be  the  infimum  of  the 
right-hand  side  of  the  above  equation  on  the  set 


S*  = {(v[,v2)  : v[  > 0 , v2  > 0 , (pi  v[  4-  <f>2v 2 < d}  . 

Now,  since  there  are  some  ( v[,v'2 ) pairs  in  S*  for  which  there  is  no  corresponding  £', 
it  follows  that  e*  < e.  In  practice,  however,  e*  is  much  easier  to  find.  For  example, 
one  could  do  a grid  search  over  the  relevant  part  of  the  (v[,v2)  plane. 


3.4  Drift  and  Minorization  for  the  Gibbs  Sampler 
In  this  section  we  will  not  distinguish  between  the  balanced  data  case  and  the  more 
general  setting.  However,  we  will  assume  that  5 m!  > m"  where  m!  = min{mi, . . . ra#} 
and  m"  = max{mi, . . . , mK}.  That  is,  while  the  data  may  be  unbalanced  it  may  not 
be  overly  so.  We  will  also  assume  that  ax  > 1 + (2 z)-1  for  an  arbitrary  2 € (0, 1). 
Hobert  and  Geyer  (1998)  were  able  to  establish  geometric  ergodicity  only  when  m!  > 
{y/h  — 2 )m"  and  ax  > (3 K — 2) / (2 K — 2).  Note  that  if  we  fix  a value  of  K we  can 
choose  a z close  enough  to  1 so  that  (3 K — 2) / (2 K — 2)  > 1 + (2 z)-1.  Because 
\/5  — 2 > 1/5  our  constraint  on  m!  and  m"  is  also  (slightly)  weaker. 


92 


3.4.1  Drift 

We  will  begin  by  defining  some  constants  that  will  be  required  in  the  following 
argument.  Let 


V = 


K 2 + 2Kai 


< 1 and  7 


z ( 2d  i + K — 2) 


2A06i  + K2  + 2 Kax  K 

Recall  from  the  previous  section  that  <5i  = (2ai  + K — 2)-1.  Now  observe  that 


K8l  + ^-<\ 
7 K 


since 


1 - KSi 


JL 

7 K 


> 


K 

2ai  + K — 2 
K 

2 0,1  + K — 2 
2z(a1  - 1)  - 1 
z(2ai  + K — 2) 


> 0 


rj  K 
K z(2ai  + K -2) 
1 

z(2ci\  -(-  K — 2) 


because  of  the  constraint  on  07.  Thus  there  exists  0 < pi  < 1 such  that 


KSl  + K~  = (K  + z)  Sl  < Pl  < 1 ' 

An  application  of  the  following  lemma  will  show  that  as  long  as  5 m'  > m",  \g  > 0 
and  Ae  > 0 then 


m"  XP 


A g + m"  Xt 


+ 


A g 


A g + m'\e 


< 1 


and  hence  there  exists  0 < P2  < 1 such  that  if  8 — max{p,  z}  then 


rn"  X,, 


A g + m"  Ae 


+ 


Xg 


X g + m'Xe 


< P2  < 1 


Also,  we  let  A = max{p1,p2}. 
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Lemma  3.4.1.  Suppose  X > 0,  Y > 0 with  at  least  one  of  X or  Y strictly  positive. 
Also,  suppose  a > b > 0 are  constants  such  that  5 b > a.  Then 


( aX  \2  / Y \2 

+ W +\bX  + Y) 


< 1 . 


(3.23) 


Proof.  We  begin  with  a preliminary  matter.  Let  g(v)  = v + cv~x  where  c > 0 and 
v > 0.  Then  g'(v ) = 1 — cv~ 2 so  that  the  only  critical  point  is  v — \fc.  Now 
g"[v)  = 2cv~3  > 0 and  hence  v is  a minimum.  Now  consider  (3.23). 

\aX  + Yj  \bX  + Yj 
(aX  + Y)2  - a2X2  Y2 
( aX  + Y )2  (bX  + Y)2 

Y2  + 2 aXY  Y2 

(aX  + Y)2  ~ (bX  + Y)2 
{bX  + Y)2(Y2  + 2 aXY)  - Y2(aX  + Y)2 
(aX  + Y)2(bX  + Y)2 

(b2X2  + 2 bXY  + Y2){Y2  + 2 aXY)  - Y2(a2X2  + 2 aXY  + Y2) 

( aX  + Y)2{bX  + Y)2 
2 ab2X3Y  + 2 bXY3  + (62  + 4a6  - a2)X2Y2 
( aX  + Y)2{bX  + Y)2 

2bX2Y2[ab(X/Y)  + {Y/X)\  + ( b 2 + 4 ab  - a2)X2Y2 
{aX  + Y)2{bX  + Y)2 

X2Y2 [4a62 (1  / Vab)  + 2 bVd>\  + ( b 2 + iab  - a2)X2Y2 
~ ( aX  + Y)2{bX  + Y)2 

X2Y2(b2  + 4 ab  + ibVab  - a2) 

(aX  + Y)2(bX  + Y)2 
X2Y2(5b2 + 4ab-a2) 

~ {aX  + Y)2{bX  + Y)2 

X2Y2[5(b  + 2a/5)2  — 9a2/5]  . 

(aX  + Y)2{bX  + Y)2  K ' ’ 

Now  as  long  as  5b  > a the  numerator  in  (3.24)  will  be  positive.  □ 


Define  the  energy  function  as 


V (/x,  6 , A)  — wi(Xe)  + W2(\e)  + 'ywz(Xg)  + w^(0,  A) 
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where  Wi(Xg)  = ecXe , W2{Xe)  — ecAe,  w3(Xe)  = Xg1  and 


with  0 < c < min{6l5  ft2}5  9 = @i  and  y = K~l  J2i  Vi-  We  will  show  that 

m"X' 


£[U(/z,0,A)|//,0',A']  < (k+^S^w 3(A'e)+(5 


A + m"  X' 


X'a 


K + rn'X'e 


w4  (0,  X)  + b 


< PiTW3(X'e)  + p2ic4(0',  A')  + b 

< xv(p',e',\')  + b 


(3.25) 


where 
b = 


h 


bi  — c 
+ V 


cti+K/2 


+ 


b2  \ “2+iV/2  Z 


&2  - C 


+ -[2b1  + K 


Af 


+ (a*o  - y? 


+ s4 


with  s2  — Y^ZiiVi  ~ V )2-  Recall  that  we  are  considering  the  Gibbs  sampler  with  the 
following  sampling  scheme:  (p',91,  A')  ->  (p,  0,  A).  This  is  the  same  update  order 
used  by  Hobert  and  Geyer  (1998).  We  will  calculate  the  required  expectations  via 
the  following  rule: 

E[V(p,0,\)\p',G',\'}  = E[V(p,0,  A)|0',A'] 

= E{E{E[V(p,  0,  A) \p,  0 , 0',  \']p,  O' , A'}|6>',  A'} 


We  now  begin  the  main  calculation.  Using  the  above  rule  we  obtain 

ai+K/2  y . x ai+K/2 


E[wi(Xe)\p,9} 


^ + lEUQj  - p)2 
h + ~p)2  ~cj 


< 


h 


b\  — c 


(3.26) 
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and 


E[w2(\e)\0,n\  = 


+ 1 Eij(vv  - «.)2  N-+W 


fe2  + I - ei)2  - C, 


< 


bo  - c 


Now  consider  tc3.  Then 

E[w3(Xe)\n,  6}  = <5i 

and 


K 


2bi  + — l*Y 


i=l 


(3.27) 


E[(Qi  — n)2\n,  A']  = Var(6i\n,  A')  + [E(9i\n,  A')  — //)] 


1 


X!g  + mriiX'e 


+ 


X'gfi  + miX'eyi 

X'e  + rriiX'e 


2 

n 2 


- 


+ 


< 


A'0  + mjA'e  \ X'9  + miX'e  J 

1 / m"X^ 

X'e  + m'X'e  + \ A'e  + m"Ag 


(/*  - ViY 

(/*  - y*)2 


so  that 


2=1 
2 K 


* KwM)+(jYt^k)  i>-p+f-R) 


5 + ( " 5)2  + s2 ' 


Let  0'  = Then 


E{{n-W 


=n2 


Var(#x|0',A,)  + [£?(/i|0',A')-y] 

1 


A0  + K A' 
1 


+ 


< 

< 


Ao  + i^Ag  _Ao  + TtTAg 
1 A0 


AoMo  + KX'gd'  = 
Ao  + #A'  “ y 

Ao 


n 2 


Ao  + ifA;' 


Ao  + i^Ag  Ao  + KX  g 
1 


7(»-5)2  + t^w(«'-5)2 


Ao  + A'Ai 


Ai 


+ (/i0  -y)2  + w4(0',A') 
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where  the  first  inequality  is  the  convexity  inequality  [AX+(1— A)F]2  < AX2+(1— A)T2 
for  0 < A < 1.  Hence 

E^w^XeW , 0',  A']  < KSi1w3(X'i)  + z ( ,,  i»4(0',  A')  + 63  (3.28) 


where 


h = -[2b1+K 


-r-  + (/h)  - V? 

AO 


+ S' 


Now  we  turn  our  attention  to  uq.  First  note  that  by  Jensen’s  inequality 

e(.  kx!„  |„,<A  < 


Ao  + KXg 


A0  + KE(Xg\iJL,  9) 

K 2 + 2Kax 


K 2 + 2 Kax  + A0(26!  + £^(0*  - m)2) 
K2  + 2 Kax 


< 


2A0fti  + K2  + 2Ka\ 

= V 


As  in  Hobert  and  Geyer  (1998)  observe  that  the  conditional  independence  of  the  s 
implies 


K K 

7)1  \ tvt  I 1 + mi^eyi  1 


1 


K A<?  + rriiXe  ’ A'2  “ A^  + mi\e 
, 1=1  2=1  / 

and  hence  by  using  Jensen’s  inequality  again  we  obtain 


£[(0-j/)V,A']  = Var(0|ftA') +[£(%,  A')- F] 

1 K 1 [ 1 K 
lo  ^ A'  + m^Ag  + ^^At 


< 


i=l 


1 1 


A() 


K X'ft  + rriiX' 

i=i  s e 


(a*  - y<) 


1 f / 

/ y + ^ Z->  I 


A'fl 


/f  A^  + ra'Ag  if  \A^  + m,i 


iXL 


(m  - y*)5 


< kMX,)+  {ye  + m‘Xe, 


(V-W  + -P- 


K 


Therefore,  reusing  the  calculation  for  E[(n  — y)2\0',  A']  , we  get 


£>4(0,  A)|/.',  0',  A']  < lw3(Xt)  + q ( A, 


uq(0',  A')  + b4 


(3.29) 
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where 


b4  = V t-  + (A*o  - yf  + 

Ao  ti 

Combining  (3.26),  (3.27),  (3.28)  and  (3.29)  will  now  yield  (3.25). 


3.4.2  Minorization 

At  this  point  let  d > 0 be  arbitrary  and  define  S — {(/x,  0,  A)  : V(fi,  6 , A)  < d}. 
Similar  to  our  previous  work  with  the  block  Gibbs  sampler  our  goal  is  to  find  a density 
q(n,  6 , A)  on  K x RK  x R^  and  an  e > 0 such  that 

7r(/x|0',  A')7t(0|^,  A')7t(A|/x,  6)  > eg(/x,  6 , A) 


for  all  {O' , A')  G S.  Note  that  by  the  convexity  inequality 


1(0,  A) 


( Ao^o  + KXgd  Ao  ( 

= ( Ao  + ATA.  -y)  -X^KX,^ 

< (/xo-^)2  + ^4(6>,A) 


r+ 


K Xe 

Aq  + KXg 


(0  - V)2 


so  that 


5 = {(/x,0,A):V(/x,0,A)<d} 

C {(/X,  0,  A)  : Wi(Ae)  < d,  w2{ Xe)  < d,  w3{ X9)  < d/7,  u >4(0,  A)  < d} 

= {(/x,  0 , A)  : ecAs  < d,  ecXe  < d,  l/Xg  < d/7,  (/x0  - ^)2+ 
tt;4(0,A)  < (/x0  — p)2  + d} 

C {(/x,  0,  A)  : 7/d  < < c_1  log(d),  0 < Ae  < c-1log(d), 

1(0,  X)  < (iMi  -f)2  + d} 

= { (/x,  0,  A)  : 7/d  < A#  < c_1  log(d) } n { (/x,  0,  A)  : 0 < Ae  < c-1  log(d) } D 
{(/x,  0,  A)  : C\  < (Ao^xo  + K Xg6) / (Xo  + ATAg)  < C2} 

= Csj  n Cs2  n C53 
= C5 
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where  c\  — y — yj (/x0  — y)2  + d and  c2  = y + yj o - y )2  + d.  Note  that  Cs1  fl  Cs2 
is  nonempty  as  long  as  d is  sufficiently  large,  i.e. , d > 1 and  7 c < d\og(d).  Hence  it 
suffices  to  establish  a minorization  on  Cs ■ Now  as  long  as  ( G A')  G Cs 


7r(/i|0',  X')n(G\n,  X')n(X\n,  G ) 


> 7t(A| fi,G)  inf  [k(h\G',  \')tt(G\ij,,  A')] 
(e',\')ecs 


inf  7r(0|/x,  A') 
X'  £Cs1<~iCs2 


inf  n(fi\G',X') 
(0\\')ecs 


> n(X\fi,G) 

Recall  that  the  0;’s  are  conditionally  independent.  We  begin  with 


K 


inf  7r(0|/i,  A')  = inf  TT  7r(0j |/x,  A'). 

VeCSlnCs2  VeCSlnCs2 


Note  that  by  the  convexity  inequality 


n(0i\n,  A') 


A«  + miX'e  \ 1/2  j Xg  + T7ij A'  , 

1 exp  < r I Bi  - 


2tt 

X'e  + mi\'e\1/2 


Xgfi  + miX'eyi' 

\'e  + mi\'e 


2ir 


x 


exp 


X'e  + mjAg 


A*  (ft  -«)+,,  "“5  „ (ft  - ft) 


n 2 


LAg  + mjA'g 


A'e  + m^Ag 


> 


Ag  + mjAg\1/2 


2tt 


x 


exp  < - 


X'e  + m^Ag 


X'0(9j  - ^)2  + mjX'e(6i  - j/j)2 
L A'e  + mjAg  A'„  + mjAg 


X 0 + miXA  exp  [A'0(0j  - n)2  + mjAg(0j  - y{)2] 


2tt 


and  hence 


a: 


t=i 


= 0i (/a,®)  • 


Now  we  turn  our  attention  to  7r(/i|0',  A').  We  will  require  the  following  obvious 
lemma.  Let  N(r,  a2\w)  denote  a N(r,  cr2)  density  evaluated  at  w. 
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Lemma  3.4.2.  Let  C\  < c2  be  two  constants.  Then 


inf  N(t,  a2\  w)  = < 

ci<r<C2 


N(c2,  ct2;  w ) ifw< 
N(Cl,a2;w ) ifw>^ 


Note  that  for  (0',  A')  G C5 


, lQl  / A0  4-  ( Ao  + i^Ag  ^ A0/i0  + KX'g6' 

7r(/i|0  , A ) = [ 1 ) exp  ^ ^(/u- 


27t  ) r|  2 V*~  Ao  + -R"Ag 

> A 0 + y7/d  Ao  + (JC/c)logWy/2,, 

A„  + (A-/c)log(<i)  27T  J 

A0  + (A'/ c)  log(d)  / Xoiao  + KX'gd1 
exp  < ^ - 


A0  + K A' 


so  that  an  application  of  Lemma  3.4.2  yields 


inf  *(„!«',  A')  > inf  ( A„  + A-7/d  A,  + (^/c)log(d)\ 
',A')€Cs  («',V)€CSs  VA0  + (tf/c)log(d)  2tt  / 

J A0  + (K/c)  log(d)  / A0/i0  + KX'gO' 

exp  S « ( T 


A0  + K A' 


> 1 


where 


Thus  we  take 


and 


A0  + (K/c)  log(d) ) 

N(c2,  [A0  + {K/c)  log(d)]-1;  //)  n<Tj 
N(ci,  [A0  + {K/c)  log(d)]-1;  /j)  /z  > £ . 

q(n,  0,  A)  oc  0)p2(/i)7r(A|/i,  0) 


£=  / 0i (/*,  0)g2(n)dndB 

Jrk  Jr 

which  is  substantially  more  difficult  to  calculate  than  the  e from  the  block  Gibbs 
argument  in  Subsection  3.3.3. 
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3.5  Minorization  in  Hierarchical  Linear  Mixed  Models 
We  consider  Bayesian  versions  of  the  usual  linear  mixed  model  (Searle  et  al. 
1992).  In  this  context  we  establish  the  minorization  condition  required  to  implement 
regenerative  simulation  for  the  associated  block  Gibbs  sampler.  However,  except  for 
the  special  case  of  the  hierarchical  random  effects  model  we  considered  earlier  in 
Section  3.3  the  convergence  rates  of  these  samplers  are  unknown. 

Consider  the  usual  frequentist  general  linear  mixed  model  (Searle  et  al.  1992,  p. 
317), 

Y = X(3  + Zu  + e 

where  Y is  an  n x 1 vector  of  observations,  X is  a known  n x p model  matrix,  Z is 
a known  n x q design  matrix,  (3  is  a p x 1 vector  of  parameters,  u is  a q x 1 vector 
of  random  variables,  and  e is  an  n x 1 vector  of  residual  errors.  We  will  also  assume 
that  X is  of  full  column  rank  so  that  X'X  is  invertible. 

A Bayesian  version  of  this  model  may  be  expressed  as  a conditionally  independent 
hierarchical  model 

Y |/3,u,R~  Nnpr/S  + ZiqR-1) 

/3~  NP(/30,B-X)  u|D  ~ Ng(0,  D-1)  (3.30) 

with  as  of  yet  unspecified  priors  /(R)  and  /(D).  Here  (30  and  B^1  are  assumed  to  be 
known.  Then  the  posterior  density  of  (/3,  u,  R,  D)  given  the  data,  y , is  determined 
by 

i r(/J,  u.  R,  D|y)  oc  f(y\0,  u,  R)/(/3)/(u|D)/(R)/(D)  (3.31) 

where  / again  denotes  a generic  density.  The  only  two  requirements  we  have  for 
the  priors  on  R and  D is  that  the  resulting  posterior  (3.31)  be  proper  and  the  full 
conditionals  be  easily  sampled.  Even  if  we  use  proper  conjugate  priors,  the  integrals 
required  for  inference  through  this  posterior  are  not  typically  available  in  closed  form. 
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For  our  model  the  full  conditional  densities  required  for  the  block  Gibbs  sampler 
satisfy,  with  £ = (/3T , ur)T, 

ir(R|{,D,y)  = CRK)-1|R|1/2exp{-0^(y  - X0  - Zu)TR(y  - X/3  - Zu)}/(R) 
jr(D|£,R,y)  = CD(£)-1|D|1/2exp{-0.5u,'Du}/(D) 


where  Cr(£)  and  Cx>(£)  are  the  normalizing  constants.  That  is,  these  conditional 
densities  depend  on  the  eventual  choice  of  /( R)  and  /(D).  The  density  7t(£|R,  D,  y) 
is  ( p + g)-variate  Normal  with  mean  £0  and  covariance  matrix  £_1  where 

/ 


XTRX  + B XTRZ 
ZTRX  ZTRZ  + D 


and 


XTRy  + B/30 
ZTy 


\ 


Throughout  we  will  assume  that  the  block  Gibbs  sampler  evolves  according  to  the 
following  sampling  scheme:  (D',R',£')  ->  (D,R,  £).  Then,  suppressing  the  depen- 
dence on  the  data,  the  transition  density  is  given  by 


k(  D,  R,  £|D',  R',  £')  = 7r(D|£>(R|£>(£|R,  D) 


Now  we  can  follow  the  basic  recipe  of  Mykland  et  al.  (1995)  for  calculating  minoriza- 
tion  conditions  for  Gibbs  samplers.  To  this  end,  fix  a point  £ and  sets  MR  and  Md 
so  that 


MD,R,£|D',R',£') 


7r(Dl£/)7r(R|£') 

7t(D|£)7t(R|£) 


7r(D|£)7r(R|£)7r(£|R,D) 


> 

inf 

[ inf 

_ReMa  tt(R  £)  _ 

_DeMD  7t(D|£). 

7r(D|£)7r(R|£)7r(£|R,  D)/[R  G MR]J[D  G MD] 
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Then  the  minorization  condition  will  follow  by  taking 


inf 


7T(R|0 


_R£Mr  tt(R|£)  _ _DeMD  tt(D|^)  . 


inf 


(3.32) 


and 


?(D,R,0  = C-17r(D|07r(R|07r(^|R!D)/(R  G MR)/(D  € MD) 


where 


cq  = j 7r(D||)7r(R|£)7r(£|R,  D)/(R  G MR)/(D  G MD)  dRdB  . 
When  R G Mr  and  D G MD  the  probability  of  regeneration  is  given  by 

IMf=l|D\HXD,H,e)=  f W *1  [ inf 


RCMr  7t(R|^)  J [dgmd  tt(D|£)J  tt(D|£>(R|0  ■ 

(3.33) 

So  all  we  have  to  do  is  calculate  the  infima  in  (3.32)  and  plug  into  (3.33).  To  this 
end  let  a^j  < a,2ij  for  i = 1, . . . , q and  j = 1, . . . , q be  constants  and  define 

MD  — {IVIgxg  ■ ®1  ij  Wly  ®2 ij}  ■ 


Then 


inf 


dgmd  tt(D|0 


7r(D|£')  _ Cp($)  exp{— 0.5u'TDu'} 

Cd(0  dgMd  exp{— 0.5urDu} 

= ^DsLexp{-0'5[u'TDu'-flTDa)} 

Cp(l)  Im 

Cj}{£')  d6Md 


inf  exp  ^ —0.5 


ui ujd%j  uiUjdij 


L i j 


* 3 


inf  exp  < -0.5  Y^  Y^ («'?/•  - UiUj)di 
CD(^  ) DSMd  [ ij 


> 


exp  | -0.5  ^2  _ 

CD(()  , , 

c^9(u'u) 


(3.34) 
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9ij 


where 

( 

(luj  if  u[u'-  — inuj  < 0 
a,2ij  if  u'ju'j  — UiUj  > 0 

Let  v'  = y — X/3'  — Zu'  and  v = y — X/3  — Zu.  Also,  let  buj  < b2ij  for  i = 1, . 
and  j = 1, . . . , n be  constants  and  define 


.,n 


inf 

RgMr,  <r(R|C) 


where 


Hr  — {hdnxn  : buj  ^ TYlij  < b2ij } • 


Cr(Q  inf  exp{— 0.5(y  - X0  - Zu'fR (y  - X/3'  - Zu')} 
Ck(£)  RgMr  exp{— 0.5(y  - X/3  - Zu)TR(y  -X/3-  Zu)} 
Cr(£)  .nf  exp{— 0.5v/TRv'} 


Cr(£')  Re  Mr  exp{— 0.5vrRv} 

> exp  | -0.5  Yl(v'iv'j  ~ wAK 

■ 


hij 


buj  if  v'w'  - ViVj  < 0 
b2a  if  v'tv'j  - v^j  > 0 


The  probability  of  regeneration  is  given  by 


(3.35) 


Pr(<S  = 1|D',  R',£',  D,  R,  £)  = y(u',  u)h(v',  v)  exp  {— 0.5(uTDu  — u,rDu)}  x 

exp  {-0.5(vrRv  — v^Rv)}  . 


3.6  Data  Augmentation  Schemes 

Data  augmentation  refers  to  the  very  general  technique  of  introducing  latent  vari- 
ables into  a setting  in  order  to  improve  computational  efficiency.  For  example,  sup- 
pose we  have  a target  density,  n,  that  we  would  like  to  sample  using  Gibbs  but  each 
full  conditional  is  not  explicit.  Then  we  may  be  able  to  find  another  density,  g , such 
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that 

J g(x,  z)dz  — tt(x).  (3.36) 

In  this  case,  g is  said  to  be  the  completion  of  7r  and  we  can  now  treat  g as  the  new 
target  density.  That  is,  we  can  construct  a Gibbs  sampler  from  the  full  conditionals 
of  g.  Not  having  explicit  conditionals  is  just  one  reason  to  use  data  augmentation. 
However,  (3.36)  does  illustrate  the  general  approach  of  finding  a “larger”  model,  g, 
that  has  the  target,  7r,  as  a marginal. 

In  the  rest  of  this  section  we  first  establish  geometric  convergence  for  a simple 
model  taken  from  van  Dyk  and  Meng  (2001).  Then  we  consider  slice  samplers  in 
some  detail. 


3.6.1  Marginal  Augmentation 

Let  t(u , p,  A)  denote  a Student’s  t distribution  with  v degrees  of  freedom,  location 
parameter  p and  scale  parameter  A.  Assume  v is  known  and  suppose  for  * = 1, . . . , n 
that  the  Yi  are  iid  t(u,  p,  A).  For  now,  we  also  assume  only  that  n(p,  A)  is  a prior  such 
that  the  posterior  for  (p,  A),  i.e., 


f(y\p,  A)tt(p,A) 
m(y) 


is  proper.  Here  y — (y1; . . . , yn)T  and  m(y)  is  the  marginal  density  of  y.  Our  goal  is 
to  sample  from  7r (p,  A|y).  van  Dyk  and  Meng  (2001)  show  how  to  embed  this  model 
into  a larger  one  that  yields  a Gibbs  sampler  with  desirable  mixing  properties.  Let 
a > 0 and  consider  the  two-stage  hierarchy 


Y\q  ~ N(„,24) 

, ~ Gamma  (£,£). 


Then  the  marginal  distribution  of  Y does  not  depend  upon  a,  more  specifically,  Y ~ 
t(u , p , A).  This  fact  will  allow  us  to  embed  our  original  model  into  the  following  more 
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complicated  one.  Now  suppose  (Fj,  g*)  are  independent  random  pairs  and  consider 
the  following  hierarchical  model 


Yi\qhn,  \,a  ~ 
qi\a  ~ 


Gamma 


(//,  A,  a)  ~ 7r(/x,  A)7r(a) 


(3.37) 


where  we  now  take  7r(/r,  A)  oc  A 1 and  let  7r(o;)  a a 1.  Let  q = (qi,  • • • , qn)T  so  that 
the  posterior  for  (3.37)  is  determined  by 


Tr(qiH,\,a\y)  oc 


n 

n/tete.M,A,a)/to|a) 

_i=l 


tt(/x,  A)7r(a)  . 


If  we  first  integrate  the  posterior  with  respect  to  q,  then  the  only  remaining  a’s  are 
from  7r(a).  Thus,  it  is  easy  to  see  that  this  posterior  is  improper.  Of  course,  this  means 
that  the  Gibbs  chain  {((<fy),  (fij,  Xj,  aj)T)  : j = 0,1,...}  based  upon  this  posterior 
will  not  be  positive  recurrent  which  is  contrary  to  a critical  piece  of  our  assumption 
(.A).  However,  it  is  important  to  keep  in  mind  that  we  are  interested  in  sampling 
from  7r(/x,  A| y)  and  not  the  full  posterior.  In  fact,  it  is  fantastic  that  implementing 
the  Gibbs  sampler  based  on  this  model  yields  a subchain  (/if,  A j)  which  is  a positive 
recurrent  Markov  chain.  (For  more  on  the  relationships  between  Markov  chains  and 
their  subchains  consult  Hobert  (2001b).) 

Meng  and  van  Dyk  (1999)  and  Hobert  (2001a)  both  describe  in  detail  how  to 
use  Gibbs  to  sample  from  7r(/x,  A|y).  Because  we  will  make  use  of  this  scheme  it  is 
repeated  here.  Observe  that  even  though  n (q,  /i,  A,  a\y)  is  improper,  as  long  as  n > 2, 


/ 7r(<7,  /i,  A,  a\y)dq  < oo  and 


J 7r(q,  /x,  A,  a\y)d/j,  dX  da  < oo  . 


Thus  we  can  define  legitimate  conditional  densities  n(q\fi,  A,  a,  y)  and  n(y,  A,  a\q,  y). 
(Hobert  and  Casella  (1998)  discuss  the  dangers  usually  encountered  when  one  imple- 
ments Gibbs  with  full  conditionals  that  have  no  corresponding  joint  density.)  In  this 
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case,  the  transition  density  for  the  subchain  of  interest  is  given  by 


k(n, X\y',  X')  = J j 7r(/i,  \,a\q,y)n(q\y',X',a',y)dqda  . (3.38) 


Note  that  a closed  form  for  k(y,  A')  is  unavailable.  Also,  because  ( yj,Xj ) is  a 
Markov  chain  (Meng  and  van  Dyk  1999,  p.  310)  the  right-hand  side  is  not  a func- 
tion of  a'.  However,  it  is  easy  to  simulate  from  k by  drawing  from  n(y,  X,a\q,y) 
and  Tr(q\y',X',a',y).  Let  a!  be  arbitrary.  Then  given  (//,  A')  we  draw  q from 
n(q\y',  A',  a',  y)  and  subsequently  draw  (y,  A,  a)  from  n(y,  A,  a\q,  y)  and  keep  (y,  A). 
We  can  draw  from  ir(y,  A,  a\q,  y)  by  sequentially  drawing  from  the  following  condi- 
tionals 


a\q  ~ 
A|a,  q ~ 

y\X:a,q  ~ 


f) 


IG 


( n - 1 q.v(q,y)\ 

\ 2 ’ 2 a J 


N (e(q,y),  ^ 


where  q.  = Yn=\  ft  and 

1 n i n 

e(q,y)  = -J^i/ift  and  v(q,  y)  = - ^ - e(q,  y)f . 

Q'  i=  1 ^ i= 1 

Later  we  will  use  the  fact  that  e(q,y)  and  v(q,y)  are  the  expectation  and  vari- 
ance of  a discrete  random  variable  taking  the  values  t/i,  y2, . . . , yn  with  probabilities 
qi/q.,  qz/q-,  • • • , qn/q • • Finally,  we  note  that  each  of  the  components  of  q are  condi- 
tionally independent  and  for  i = 1, . . . , n 


qi\y,  A,a~  Gamma  ^ 


(Vi  ~ /*)S 
A 


so  draws  from  n(q\y',  A',  a',  y)  are  easily  obtained. 

Hobert  (2001a)  established  geometric  ergodicity  for  the  subchain  ( Hj,Xj ) when 
n = 2.  On  the  other  hand,  most  of  his  argument  applies  to  the  more  general  case 
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of  a sample  of  size  n.  We  now  take  up  his  argument  and  establish  a geometric  rate 
of  convergence  for  the  general  case.  However,  in  the  interest  of  completeness  we  will 
begin  by  retracing  some  of  Hobert’s  (2001a)  steps. 

We  will  establish  the  required  drift  and  minorization  conditions  with  the  following 
energy  function 

V(n,  A)  = j - o)2- 

2—1 

Hobert  (2001a)  uses  Lemma  3.3.1  to  establish  the  required  minorization  condition. 
Theorem  3.6.1.  (Hobert  2001a)  The  Markov  transition  density  k(/i,  A| g! , A')  satisfies 
the  following  minorization  condition 

k(g,  X\g',  A')  > eh(g,  A)  V (//,  A')  G S = {(//,  A)  : V(g,  A)  < d} 


where  d is  any  positive  constant  and  h(g,  A)  is  a density  on  E x 1R+  given  by 

g(Qi) 


h{g,  A)  = //  tt(//,  X,a\q,y) 


n 


it  Jo  9(q)dq 

and  e = (/0°°  g(q)dq)n.  The  function  g(-)  is  given  by 


dq  da 


g{q)  = < 


Gamma  if  0 < q < q* 

Gamma  if  q > q* 


where 


* 1 1 , d 
q = — j—  log  l + 


d 


v 


Now  we  extend  Hobert’s  (2001a)  argument  to  establish  the  drift  condition. 
Proposition  3.6.1.  Let  V(y,  A)  = \ Ya=  i(Vi  ~ h)2-  If  v > 1 and  nv  > 2 then 

~2MT(n  - l)(v  + 1) 


E [V(n,  X)\y' , A']  < 


( nv  — 2 ){v  — 1) 


V(y',  A')  + b 


(3.39) 


b = 


nv  v(n  — 1)  f MTn( v + T) 


nv 


+ 


nv  — 2 \ 


2T 


i<j 


(hi  - ViY 


where 
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M = max 


(yi  - Vn)2  ’ (2/1  - V2)2  ’ (2/2  - 2/3)2  ’ ' ” ’ (2/n-l  “ 2/n)2 


and 


^ = E 


»=i  Lj=i 


YliVi  - 2/j)2  + 2 X K&  ~ 2/j)(2/»  - 2/fc)  | 


j<k 


Remark.  Of  course,  we  can  use  (3.39)  with  Theorem  3.6.1  to  claim  that  the 
marginal  chain,  (/ij,  \j),  is  geometrically  ergodic  for  all  u such  that 

2MT(n  - \){u  + 1) 

( nu  — 2)(u  — 1) 

Proof.  Hobert  (2001a)  shows  that  as  long  as  nu  > 2 


< 1 . 


£r=it2/i -e(<?,2/)]2, , v 

l#,’A 


1=1 


2/i 


Note  that 

X^i  - e(9,  y)}2  = X 

t=i 

= -X 

q2h 

n 

= ?E 

i=l 

= ^E 


1 n r 
- X VM 


ViQ-  - X yiqi 
i= 1 

n 

Yqi(yi  ~ % ) 

.j=i 


n 2 


9.  7 


i=i  Lj=i 


XI  qUyi  ~ yo)2  + 2Y  qiqk(yi  - 2/j)  (2/i  - yk) 


j<k 


n 

< E 

i=l 

= T 


Xbi  - 2/j)2  + 2 X l(2/i  - 2/j) (2/i  - 2/fc) I 

j=l  j<k 


and  hence 


E[V(fi,  A)|/z',A']< 


^ + ^(n~  1)T£. 

nu  — 2 nu  — 2 


(3.40) 
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Now  recall  our  observation  that  v(q:  y ) is  the  variance  of  a discrete  random  variable 
taking  the  values  y\,  y2, . . . , yn  with  probabilities  qi/q.,  q2lq.,  ■ ■ ■ , qn/q • • Then 


2 5^9<9j(y<-!/j)2 

i<j 


= EE  ft  ft  (ft  - ft)2 

i= 1 j— 1 
n n 

= EE  ~ 2ViVi  + J/j) 
= EEw2+EE« 


i=l  j=l 
n 


i= 1 j=l 


= 29.  X ftft*  - 2 I XI 9ift 

i=i  \»=i 

= 292u(q,y) 


n n 


2 X qiVi  X qiyi 

i=l  j= 1 


so  that 


ft 


v(9,y)  £i<jftft(ft  ~ ft)2 

(E?=i  + 2 Ei<j  m) 


Y,i<j  QiQj(yi  - ft) 


02 


< 


'n-1 

E^ 


ft 


Qn 


1 1 

_7/^2+  l^(y._y.y 


“ ft+1  (ft  - ft+l)2  ft  (ft*  - 2/l) 

h2X7ETEE 


' 91  tr  ®+i 


t<j 


(ft  - ft)2 


no 


and  because  the  <?;  are  conditionally  independent 


E 


-+E—  )i/*',v 

Ti 

= A')i?(gfl|/i',  a')  + E «(«!/>  A') 


n— 1 


< 


I/  + 1 

I/  - 1 

I/  + 1 
v - 1 
1/  + 1 


i=l 
n— 1 


(2/1  ~ m')2  + AV  ^ (y»+i  ~ aO2  + 

(: Vn  - m')2  + AV  " (y<  - ^')2  + ^ 


i=l 


AV 


-VV.A')  + 


n{v  + 1) 


i/(u  — 1)  vr~  ’ ' ' v — 1 

Combining  this  with  (3.40)  yields  the  result. 


□ 


Now  that  we  have  conditions  which  will  guarantee  a geometric  rate  of  conver- 
gence we  will  discuss  how  to  implement  regenerative  simulation.  To  begin  fix  a point 
(y, A, a)  and  let  D = [du,di2]  x [^21,^22]  x •••  x [d„i,dn2]  where  all  of  the  d{j  are 
known  positive  constants.  Now  consider  the  transition  density 


k{fi,  A|/t',  A')  = j j ir(fi,\a\q,y)n(q\li',\',a',y)dqdc 

= J f ?’  V\  A,  <>\q,  yMq\y',  V,  a',  y)dqda 


> inf 


n(q\jx,  A ,d,y) 
n{q\il,\,a,y) 
n{q\n',\',a',y) 
qeD  n(q\ji,\,a,y)  . 

/ / /d(9)  7r(/x,  A,  a|g,  2/)tt  (qrj/x,  A,  a,  y)dqda 
®(/A  A')/(a4,  A) 


f(n,X)  = c 1 J J lD{q)n{vA,®\q,y)n{q\y,  ~\a,y)dqda 


with 


Ill 


where  c is  the  normalizing  constant.  Also, 

n(q\n',\'ia',y) 


s{p! , A')  = c inf 


qeD  n(q\fl,X,a,y) 


II  w— ^lexPi-L^ 


J=T  x'(yi  ~ vY 


»=i 


(vi  - /-02  (yi  - PY 


2A' 


2A 


where  for  i = 1 , . . . , n 


9i—\ 


d.,  if  (^')2  < ilhz& 

a,i  u 2A,  ^ 2A 


(yi-n')2  ^ (y«-£)2 
2A'  ^ 2A 


di 2 if  vy‘ow  ; > 


Then  the  probability  of  regeneration  is  given  by 

p / r -i  | / w „ ^ s(//,  A')/^,  A) 

Pr(5  — l|/x , A , /i,  A)  k^x^^x')  ' 

Since  k(p,  X\p',  A')  and  /(//,  A)  are  not  available  in  closed  form  it  is  necessary  to 
evaluate  these  quantities  numerically.  Moreover,  this  numerical  integration  would 
have  to  be  performed  every  time  the  chain  entered  the  set  D.  Thus,  while  using 
marginal  augmentation  may  result  in  a sampler  that  enjoys  nice  mixing  properties, 
it  appears  that  calculating  valid  Monte  Carlo  standard  errors  may  be  difficult. 


3.6.2  Slice  Samplers 

Suppose  that  the  target,  7 r(x)  : Rd  — > R+,  d > 1 is  a density  such  that  7r(a:)  a 
q(x)l(x)  where  q and  l are  positive  functions.  We  can  introduce  an  auxiliary  variable 
u such  that  the  joint  density  of  x and  00  is  given  by 

n(x,u)  ot  q(x)I[0  < uj  < l(x)\  (3-41) 

where  /(•)  is  the  usual  indicator  function.  Clearly,  the  marginal  density  of  x is  7 r(x). 
Now  the  slice  sampler  (Neal  2000)  is  just  the  usual  Gibbs  sampler  applied  to  the  joint 
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density  7t(x,oj).  Thus,  the  Markov  transition  density  is  given  by 


k(cj,  x\u)',  x')  = 7r(a;|a;')7r(x|a;) 


where  uj\x  ~ Uniform  (0,/(x))  and  7r(x|a;)  oc  q(x)I[l(x)  > ui\. 

There  are  several  existing  results  concerning  the  convergence  rates  of  slice  sam- 


a theorem  that  gives  us  the  tools  to  establish  geometric  ergodicity  for  many  slice 
samplers. 

Theorem  3.6.2.  (Roberts  and  Rosenthal  1999a)  Let  G(ui)  — uj^+1Q'(uj)  where  a > 1 
and 


Then  if  n is  bounded  and  G is  non-increasing,  at  least  on  an  open  set  containing  0, 
the  corresponding  slice  sampler  is  geometrically  ergodic. 

For  slice  samplers  satisfying  the  conditions  of  Theorem  3.6.2  Roberts  and  Rosen- 
thal (1999a)  also  use  the  results  of  Roberts  and  Tweedie  (1999)  to  show  that,  under  a 
mild  condition  on  the  starting  value,  the  total  variation  distance  to  stationarity  will 
be  less  than  0.01  after  at  most  530  iterations.  We  now  consider  two  examples  from 
Damien,  Wakefield  and  Walker  (1999). 

Example  3.6.1  This  is  example  3 of  Damien  et  al.  (1999).  Suppose  0 < a < 1 and 
let  l(x)  = axI(x  > 0)  and  q be  a density  of  known  form.  We  also  assume  that  q is 
such  that  7 r is  bounded.  Now  we  need  to  show  that  G is  non-increasing  for  sufficiently 
small  uj.  To  this  end,  suppose  u < a and  note  that 


piers.  We  present  two  of  these  and  apply  them  to  two  toy  examples.  We  begin  with 


q(x)dx. 


log  a 


q(x)dx  — 1 — Fq  (log  uj  / log  q) 
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where  Fq  is  the  distribution  function  associated  with  the  density  q.  Then 


Q\uj)  = q(\ogu)/  log  a) 

id  log  a 


so  that 


Next  consider 


id 


l/a 


G(u)  = q(\ogid/  log  a) 

logo 


= - 


CJa 


I-i  r 


q{\oguj / log  o)  q'  (log  id/  log  a) 


- + 


log  a [ a logo 

so  that  G'(id)  < 0,  and  hence  G is  non-increasing,  when 

?(loga;/  log  a)  < -a 


q'  (log  id/  log  a)  log  a 

Example  3.6.2  This  is  example  4 of  Damien  et  al.  (1999).  Suppose  that  we  observe 
t\x  ~ Poisson  ( ex ) where  X ~ N(0, 1).  Then  the  posterior  is  characterized  by 

n{x\ t)  oc  e-eXeTXe~x212  oc  e~e* e~^x~T)\ 

Note  that  tt  is  bounded.  Now  let  q(x)  = e~^x~T^  and  l(x)  = e~e  . Then  Q(id ) = 
\/27rF9(log  log(l /u))  where  Fq  denotes  the  cdf  of  a N(r,  1)  random  variable.  Thus 

Q\u)  = ^(bglogd/iO) 
a;  log  a; 


and 

so  that 
G\u ) = \/27t 
Note  that 


G(id)  = V2nj- — (/(log  log(l/u;)) 
log  (d 


1/a 


-1 


logo; 


g(loglog(l/a;))  + 1_  (g»(i0gi0g(i/w))  _ 9(l0glog(l/w))) 

a logo; 


g'(loglog(l/u;))  - g(loglog(l/a;))  = 


_ r - (loglog(l/a;)  + 1)  „-o.5(1oK1oK(i/^-t)2 


V2tt 
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which  will  be  non-positive  if  u < exp{— exp{r  — 1}}.  For  such  u this  implies  that 
logo;  < 0 so  that 


~~ <f(loglog(l/u))  - g(loglog(l/w))  > 0 . 
logo; 


Hence  G'{u)  < 0 for  u < exp{-  exp{r  - 1}}.  Thus,  the  corresponding  slice  sampler 
is  geometrically  ergodic. 

Theorem  3.6.3.  (Mira  and  Tierney  2001)  The  slice  sampler  for  7 r a q(x)l(x)  is 
uniformly  ergodic  if  l is  a bounded,  integrable  function  and  the  rate  of  convergence, 
t,  to  stationarity  is  such  that 


t<l-h 


sup  l(x) 

xEX 


-1 


where  h — fxq(x)l(x)dx. 


Example  3.6.1  continued.  When  0 < a < 1 it  is  clear  that  l{x)  — ax  is  bounded 
and  integrable.  Hence  the  corresponding  slice  sampler  is  uniformly  ergodic  by  Theo- 
rem 3.6.3.  Moreover,  if  we  let  q(x)  = e~xI(x  > 0)  we  find  that  h = 1/(1  — log  a)  and 
hence  the  total  variation  distance  to  stationarity  after  n iterations  is  bounded  above 

by 

f log  a \n 
VI -log a) 

If  a — 0.01  then  24  iterations  are  required  to  get  the  total  variation  distance  below 
0.01,  but  if  a = 0.5  then  only  5 iterations  are  required  while  if  a = 0.99  then  we  only 
need  1 iteration  to  get  within  1%  of  stationarity! 


3.7  Minorization  for  Metropolis-Hastings  Samplers 
Much  of  this  section  is  taken  from  Mykland  et  al.  (1995).  Let  7r(aj)  be  the  target 
density.  The  Markov  transition  density  associated  with  a Metropolis-Hastings  chain, 
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having  invariant  density  n,  is  given  by 


k(x\x')  = a(x',  x)q( x\x')  + [1  — r(x')\8x'(x) 


(3.42) 


where 

ft  ^ • f Tr{x)q{x'\x)  \ 

a[x  , x)  = min  < , 1 > , 

[7r(a:')g(^c|a:,)  J 

r(x')  = f a(x',  x)q(x\x')dx  and  6X'  denotes  the  Dirac  mass  in  x',  that  is,  Sx> (x)  = 1 
if  x'  = x and  0 otherwise  (Chib  and  Greenberg  1995,  Mykland  et  al.  1995,  Tierney 
1998).  As  Chib  and  Greenberg  (1995)  show,  Metropolis-Hastings  algorithms  are 
constructed  so  as  to  satisfy  the  detailed  balance  condition 

k(x\x')Tr(x)  — k(x'\x)'K(x')  . 

A minorization  condition  will  follow  if  we  can  find  an  s > 0 and  a g > 0 such  that 

k{x\x')  > a(x' ,x)q(x\x')  > s(x')g(x). 

We  now  proceed  to  explain  a method  developed  by  Mykland  et  al.  (1995)  for  doing 
just  this.  To  this  end,  suppose  there  exists  a positive  function  /,  not  necessarily  a 
density,  satisfying 

f(x')q(x\x')  = f(x)q(x'\x).  (3.43) 

Let  w(x)  = 7r (x)/f(x).  Then  if  (3.43)  holds 

, , x . f w{x)  \ 
a(x  ,x)  = mm  < > ■ 

l w(x')  J 

Fix  any  c > 0 and  suppose  that  sQ  and  gQ  satisfy  q(x \x')  > sq(x')gq(x).  Now  if  we 
set 

{ c ) . . . f w(x) 

s(x)  = Sg(x)  min  1|  and  g(x)  = gq(x)  mm  j— — , 1 

then  because 


w(x)  1 1 ^ • f c 1 1 • / w(x)  i 

mm  <;  ; 1 > > mm  <{  , 1 } min  { — — , 1 


w(x') 


w(x')  ’ 
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it  is  clear  that  a minorization  condition  holds.  This  is  Theorem  3 of  Mykland  et 
al.  (1995).  These  authors  go  on  to  show  that  one  can  then  write  the  probability  of 
regenerating,  conditional  on  an  acceptance,  as 


Pr(<5  = l\x\  x) 


s(x')g{x) 
q(x \x')a(x',  x) 


(3.44) 


3.7.1  Independence  samplers 

Recall  that  an  independence  sampler  results  when  we  choose  q(x\x')  = q(x).  That 
is,  the  proposal  is  independent  of  the  current  state  of  the  chain.  Thus  (3.43)  holds 
with  / oc  q.  So  we  can  use  sq  — 1 and  set  gq(x)  = q{x).  Then  Mykland  et  al.  (1995) 
show  that  (3.44)  simplifies  to 


Pr(<5  = l\x',  x) 


max{c/«;(x),  c/w(x')} 
< ma  x{w(x)/c,w(x')/c} 

1 


if  w(x)  > c and  w(x')  > c 
if  w(x)  < c and  w(x')  < c 
otherwise. 


Thus  it  is  relatively  easy  to  implement  RS  when  using  an  independence  sampler.  For- 
tunately, as  we  mentioned  in  Chapter  2,  it  is  also  often  easy  to  establish  a geometric 
convergence  rate  for  independence  samplers.  Specifically,  recall  that  an  independence 
chain  is  uniformly  ergodic  if  for  every  k > 0 

i k(x ) 


q{x) 


< K. 


(3.45) 


We  will  use  this  result  in  subsection  3.8.2  to  establish  that  a particular  independence 
sampler  used  as  part  of  a Monte  Carlo  EM  algorithm  is  uniformly  ergodic. 


3.8  Markov  Chain  Monte  Carlo  EM  for  Random  Effects  Models 
Let  (Yj , uf)r, . . . , (Ytk,  vltk)t  be  K independent  random  vectors  and  let  A = 
(A^,  A^)T  be  a vector  of  unknown  parameters.  The  data  Y . . . , Y k are  modeled 
conditionally  on  the  unobservable  random  effects  vectors,  u!,...,u^.  Specifically, 
the  conditional  density  of  Y*  given  u,  is  denoted  by  /j(yj|uj;  Ai).  We  also  assume 
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that  the  marginal  density  of  u,  is  given  by  <7,(11*;  A2).  Observe  that  generalized  linear 
mixed  models  (Breslow  and  Clayton,  1993;  McCulloch,  1997,  McCulloch  and  Searle, 
2001)  fall  within  this  framework. 

Standard  frequentist  analysis  requires  maximizing,  with  respect  to  A,  the  likeli- 
hood of  the  observed  data,  that  is, 


Unfortunately,  the  integration  in  (3.46)  is  often  analytically  intractable  for  realistic 
models.  However,  if  we  think  of  the  random  effects  as  missing  data  then  the  EM 
algorithm  (Dempster  et  al.  1977)  provides  a method  for  approximating  these  integrals. 
Two  Markov  chain  Monte  Carlo  versions  of  the  EM  algorithm  that  are  often  useful 
in  this  context  are  introduced  in  the  next  subsection. 

3.8.1  Markov  Chain  Monte  Carlo  EM 

Each  iteration  of  the  EM  algorithm  consists  of  an  E-step  and  an  M-step.  The 
(r  + l)th  E-step  requires  calculation  of 


should  suggest  the  expectation  in  (3.47)  is  with  respect  to  the  distribution  of  u given 
y fixing  all  unknown  parameters  at  their  current  estimates. 

Unfortunately,  analytical  evaluation  of  (3.47)  is  impossible  in  most  settings.  One 
exception  is  the  Normal  theory  linear  mixed  model.  Thus  it  is  natural  to  turn  to 
Monte  Carlo  methods  in  order  to  obtain  good  approximations  of  the  desired  integral. 
Booth  and  Hobert  (1999)  considered  approximations  obtained  via  rejection  sampling 


(3.46) 


Q(A|AW)  = £{log/(y,u;  A)|y;  A(r)}. 


(3.47) 


where  /(y,u;A)  is  the  density  of  the  complete  data.  Then  the  new  value,  A^+1\ 
is  the  maximizer  of  Q(A|A(r)).  The  result  is  a sequence  {A(r)}  which  will  converge, 
under  regularity  conditions,  to  the  maximum  likelihood  estimate,  A.  As  the  notation 
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or  importance  sampling.  However,  their  approach  has  limited  applicability  when  the 
dimension  of  the  integrals  is  very  large.  This  has  led  to  the  use  of  Markov  chain  Monte 
Carlo  (see,  e.g.,  McCulloch,  1997;  Quintana,  Liu  and  del  Pino,  1999)  to  approximate 
the  integrals. 

In  order  to  use  MCMC  to  construct  an  estimate  of  the  expectation  in  (3.47)  we 
need  to  be  able  to  simulate  a Markov  chain  with  stationary  density  /(u|y,  A).  One 
way  of  doing  this  is  the  so-called  “one-at-a-time”  Metropolis-Hastings  algorithm. 
This  involves  cycling  through  each  of  the  components  of  u one  at  a time  and  up- 
dating each  using  a Metropolis-Hastings  sampler.  That  is,  at  each  step  we  use  a 
Metropolis-Hastings  chain  with  invariant  density  /(uj|yj,  A)  and  candidate  fi(iij|u-). 
McCulloch  (1997)  suggested  that  a natural  choice  for  the  candidate  density  h(iij|u() 
is  the  marginal  random  effects  density,  i.e.,  <7j(u;;A2).  This  is  one  way  to  use  an 
independence  sampler  at  each  step.  In  this  case,  the  probability  of  accepting  the 
proposal  Uj  given  the  current  state  u'  is 


On  the  other  hand,  updating  each  component  “one-at-a-time”  may  not  be  the 
most  efficient  way  to  construct  the  Metropolis  algorithm.  Instead,  we  may  wish  to 
perform  a block  update  by  drawing  a candidate  from 


That  is,  we  would  sample  each  of  the  marginal  random  effects  densities  simultane- 
ously and  independently  to  obtain  the  candidate  vector.  In  this  case  the  acceptance 
probability  is 


Thus,  either  algorithm  provides  a straightforward  approach  that  we  can  use  to 
overcome  the  intractable  E-step  by  replacing  it  with  a Monte  Carlo  approximation, 


K 


h{n)  = JJft(u*;  A2). 
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that  is, 

1 m 

Qm{  A|A(r))  = — ^log/(y,u{0;A)|y;A(r))  (3.48) 

171  1=1 

where  u^, . . . , is  the  sample  obtained  by  simulating  the  Markov  chain.  In  the 
next  subsection  we  investigate  the  convergence  rates  of  the  Markov  chains  induced 
by  these  two  algorithms. 


3.8.2  Convergence  Rates 

We  begin  with  the  block  independence  sampler.  This  chain  has  a transition  density 
given  by 

fc(u|u')  = h( u)a(u,  u')  + [1  — r(u')]flu'(u)  (3.49) 

where  r(u')  = / a(u,  u')h(u)du  and  8U>  denotes  the  Dirac  mass  in  u'.  It  is  easy  to 
show  that  this  chain  satisfies  assumption  (M)  with  invariant  density  /(u|y). 


Proposition  3.8.1.  Set  m^y^Xi)  = supu.  /i(y*|u*,  Ai).  If  each  m<(yi,  Ai)  < oo 
then  the  block  independence  sampler  given  in  (3.49)  is  uniformly  ergodic.  Let 

K 

\~[fi(yi\^z,  Ai)ft(ui|A2)  du. 

.1=1 

The  total  variation  distance  to  stationarity  after  n iterations  is  bounded  above  by 

m(y,  A) 


m 


(y,A)  = / 


1 - 


n£i  i), 


Proof.  This  requires  only  an  application  of  the  result  stated  in  (2.20).  Note  that 


K 


in^<(yilu<’Al)0i(u<lA2) 


i= 1 


is  playing  the  role  of  7 r while  h( u)  corresponds  to  q so  that  rx /q  is  given  by 


K 


K 


m(y,X)  1 n /i(y*lu-  Al)  ^ ^(y’ A)  'Urn+y^)  <00. 


i— 1 


2=1 


□ 
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McCulloch’s  sampler  is  a special  case  of  a particular  type  of  hybrid  MCMC 
sampler  which  we  will  introduce  and  then  specialize  to  his  algorithm.  Let  x — 
(xi,x2,  . . . , xp)T , where  each  xt  may  be  a vector,  and  suppose  we  want  to  sample  from 
nix).  Often  one  or  more  of  the  full  conditionals,  f(xi\x-i),  are  of  a non-standard 
form  and  difficult  to  sample  directly.  Thus  Gibbs  is  difficult  to  use.  An  alternative  is 
to  cycle  through  the  xt  and  update  each  using  a separate  Metropolis-Hastings  step 
constructed  to  have  the  relevant  full  conditional  as  its  invariant  density.  In  this  case, 
we  will  call  the  resulting  chain  a one-at-a-time  Metropolis-Hastings  sampler  (OMH). 
An  OMH  chain  has  a transition  density  given  by 


where  each  kt  is  a transition  density  for  a Metropolis-Hastings  chain  with  invariant 
density  f(xi\x-i).  In  Section  3.8.3  we  show  that  this  transition  density  for  an  OMH 
chain  has  the  desired  invariant  density,  i.e. , tt(x).  The  next  lemma  will  give  us  a way 
to  establish  that  an  OMH  chain  is  uniformly  ergodic. 

Lemma  3.8.1.  Let  ki,...,kp  be  the  component  Metropolis  transition  densities  of 
an  OMH  chain  satisfying  assumption  (A).  Suppose  that  there  are  positive  constants 
£i,...,ep  and  positive  functions  qi{xi),q2(xi,x2), . . . , qp{x)  such  that  for  each  i = 
1, ...  ,p  we  have  ki  > e then  the  OMH  chain  is  uniformly  ergodic.  Moreover,  if  we 
let 


then,  after  n iterations,  the  total  variation  distance  to  stationarity  is  bounded  above 
by  (1  — CqE  \£2  • • • £p)n  • 


koMH(x\x')  = ki(xi\x')k2(x2\xi,x'_l)  ‘ • kp(xp\x-p,x'p)  (3.50) 
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Proof.  We  will  show  that  a minorization  condition  holds  on  the  entire  space. 

koMH^x \x')  = ki(xi\x')k2(x2\xi:x'_l)---kp(xp\x-n,x'n ) 

> £i£2  ■ ■ ■ epqi(xi)q2{xi,  x2)  • • • qP(x) 

= CqElEl  • • • £pq{x) 


where  q(x)  = C~1qi(xi)q2{xi,x2)  ■ • -gp(x).  Observe  that  Cq<  oo  since 


dx 


1 = J k0MH{x\x')> 

> eX£i  ‘"EpJ 9l(®l )Q2(xi,X2)---qP{x)dx 
= Cq£\£2  ■ ■ • Ep 


and  we  assumed  that  e*  > 0 for  i = 1, . . . ,p. 


□ 


McCulloch’s  (1997)  algorithm  is  a particular  type  of  OMH  chain  in  which  each 
of  the  Metropolis-Hastings  samplers  are  independence  samplers.  Thus  the  transition 
density  for  McCulloch’s  (1997)  chain  is  given  by 

K 

fc(u|u')  = PI  Mui|u')  (3.51) 

2=1 

where  each  fcj(uj|u')  is  the  density  of  an  independence  sampler  with  invariant  density 
/(uj|yi,A)  and  candidate  pj(uj;A2).  Once  again  it  is  easy  to  see  that  this  chain 
satisfies  assumption  (yl)  and  we  can  use  the  work  in  Section  3.8.3  to  conclude  that 
the  invariant  density  is  /(u|y). 


Proposition  3.8.2.  Set  m^y^Xi)  — supu,  /i(yj|uj,  Ai).  If  each  rrii(yi,  Ai)  < oo 
then  McCulloch’s  OMH  chain  given  in  (3.51)  is  uniformly  ergodic.  Let 

K 


m(y 


,A)  = / 


SI/i(y*lu*,  Ai)(/i(ui|A2) 


,i=l 


du  . 
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The  total  variation  distance  to  stationarity  after  n iterations  is  bounded  above  by 

m(y,  A) 


1 - 


nf=i  mi(vv  ao, 


Proof.  Note  that  our  assumptions  imply  that  1 > / (y* |uj,  X^/m^y^  Ai)  and 

/(2/iKAi)  < /(yi|ui,Ax) 


fiViK^i)  rniiVi,*  i) 


Hence 


A;(uj|u')  > p(uj|A2)  min 


We  can  then  apply  the  lemma  to  obtain  the  result. 


. f1  /(y,K,Ai)|  /(y<|ui,A1)y(ui|A2) 

m I ’ /(yi|u^i)  J _ rn^y^X x) 


□ 


We  note  that  Booth  and  Hobert  (1999)  require  the  actual  value  of  Ax)  to 

implement  rejection  sampling  within  an  MCEM  algorithm  while  we  only  require  that 
mi(yi:  Ai)  be  finite. 


3.8.3  The  Invariant  Density  for  QMH 

We  assume  that  p = 2.  That  is,  we  suppose  n(x,  y)  is  a joint  (target)  density  on 
S = X x y C K2 . The  full  conditionals  are 


fx\r{x\y) 


irfoy) 

fxn(x,y)dx 


and  fY\x(x\y) 


n(x,y) 

Jyn(x,y)dy' 


(3.52) 


We  also  let  7 tx(x)  and  7 TY(y)  denote  the  marginal  densities  of  X and  Y,  respectively. 
Now  we  assume  that  neither  fx\y  nor  fy\x  are  easily  sampled  and  hence  we  will 
use  Metropolis  updates  for  each.  Throughout  this  subsection  we  will  update  in  the 
following  order:  (x':  y')  -4  (x,  y).  The  transition  density  of  this  chain  is 


k0MH{x,  y\x',  y')  = g{x \x',  y')k{y \x,  y') 


where  g and  k are  Metropolis-Hastings  transition  densities  with  invariant  densi- 
ties fx\y  and  fy\xi  respectively.  Recall  that  because  both  g and  k correspond  to 
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Metropolis-Hastings  samplers  they  were  constructed  so  as  to  satisfy  detailed  bal- 
ance. That  is,  for  fixed  y' , we  have 

g(x'\x,y')fx\Y{x\y')  = g(x\x',y')fx\Y(x'\y') 

and  for  fixed  x 

k(y'\x,y)fY\x(y\x)  = k(y\x,y')fY\x(y'\x). 

Now  we  show  that  koMH  has  the  correct  invariant  density. 


J k0MH(x,  yW,  y')d(x', ;/) 

= g(x\x',  y')k(y\x,  y')ir(x',  y’jix'dy' 

~ [ ! g(Ax  y')  /x|y  W)  - 

- jjxg{x\x,y) MxW 

k(y\x,y')fx\Y(x'\y')irY(y')dx'dy' 

= [ fx\Y(x\y')k(y\x,y')T:Y(y')  [ g(x'\x,y')dx'dy' 

Jy  J x 

= [ fx\Y{x\y')k{y\x,y')TrY(y')dy' 

Jy 

= [ k(y'\x , y)  fx\Y{x\y')nY(y')dy' 

Jy  jY\x{y'\x) 

= f k(y'\x,  y)  ^y|X^Jp-7r(x,  y')dy' 

= fY\x{y\x ) / k(y'\x,y)7rx(x)dy' 

Jy 

= fY\x(y\x)TTX{x) 

= n(x,y) 


3.9  Sampling  from  the  Posterior  of  Rosenthal’s  Model 
Rosenthal  (1996)  establishes  geometric  ergodicity  of  the  Gibbs  sampler  for  the 
following  conditionally  independent  hierarchical  model.  Suppose  for  i = 1 


that 


Yi\9i  ~ N(0*,a) 
Oi\Vi  A ~ N(/i,  A) 
A~IG(5,  c)  /(/it)  oc  1 
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where  a,  b and  c are  known  positive  constants.  The  full  conditionals  required  for 
implementation  of  the  Gibbs  sampler  are 

' /xa  + AF;  aX 


Oi |/x,A,y  ~ N 


o + A a + A 


/x|A,0,y  ~ N ( 9, 


A|0,y  ~ IG  [ 6 H — , c + ~y^(^  — 0)2  1 • 


i= 1 


In  the  rest  of  this  section  we  develop  a method  for  generating  iid  draws  from  the 
posterior  7r(0, /z,  A|y).  Consider  the  factorization 


?r(0,  h,  A|y)  = 7r(0|/x,  A,  y)n{^\\,  y)7r(A|y). 


If  we  can  simulate  from  each  of  the  three  densities  on  the  right-hand  side  then  there 
is  no  need  for  MCMC  (see  p.5  of  Besag  et  al.  1995).  Simulating  from  7r(0|/z,  A,y)  is 
easy  since  it  is  just  the  product  of  independent  univariate  Normals  (see  the  above  full 
conditionals).  We  now  proceed  to  show  how  to  sample  from  7r(/z|A,  y)  and  7r(A|y). 


125 


First  we  find  7r(/i,  A|y).  Let  c0  be  the  normalizing  constant  for  the  full  posterior. 
Then 

7r(//,A|y)  = [ n(0,n,\\y)dO 

Jrx 

K roo 

= ^w/wn e~^' 


5 


K 


nl  (>»-yi)2  r00  A+a/q 

—=e  2<A+a>  / e 2“*  ' ' A+°  ' 

27r\/aA  7-oo 


A" 


— r 1 (/*  Vt)2  t / UA 

co/(^)JJ 2<A+a>  \/27r 


r~  27r\/aA 


A “f“  CL 


K 


co/(a)  n 


(n-yi)2 

-g  2(A+a) 


f=T  vMA  + a) 

Now  for  7r(/i|A,y).  Note  that  7r(/i|A,y)  oc  tt(h,  A|y)  and  hence 

7r(/x|A,  y)  oc  e^_2(A+“)  ) a e~2(A+a jWm-s)  +a  ) ^ e-2(\+q)(/i 


where  s2  = J2(Vi  ~ v)2-  Thus  /z|A,y  ~ N(y,  (A  + a)/K). 

Finally,  since  7r(//,  A|y)  = tt(//|A,  y)7r(A|y)  we  have 

, ir(/i,  A|y)  _ c06  e-(i+ntel) 

" 7r(/x|A,y)  cA^pAw(J  + «)|K-1,'!  ' 

The  only  thing  that  remains  is  to  show  that  we  can  sample  from  7r(A|y).  Let  g( A) 
denote  an  IG(6  + 1,  c)  density  in  the  variable  A.  Then 

_ 7r(A|y)  _ c0b  1 _ 

A>S  9(A)  " A>S  (A  + „)<*-WVW«>  • 

Now  let  h( A)  = (A  + a)^-~K^2e~t2 /2(A+a)  and  note  that  fi  is  continuous  on  R+, 

lim  h( A)  = 0 and  fi(0)  = a^l~K^2e~s  ^ 2a . 

A— >oo 


dh 

dX 


K-  1 


2(A  + a) 


(A  + u) 


-(K+l)/2_s2/2(A+a) 


Also, 
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So  that  the  critical  point  is  A = ^ - a.  On  the  other  hand,  as  long  as  A > 0,  if 
A < A then  h'( A)  > 0 and  if  A > A then  h'( A)  < 0.  Thus  A is  a maximum  and 


sup 


7r(A|y) 


c0b 


a>o  #(A)  c\[K\J{ 2tt)k  1 \K  — 1 

On  the  other  hand,  if  A < 0 then 


sup 

A>0  #(A) 


This  shows  that  a rejection  sampler  with  a IG(6  + l,c)  candidate  could  be  used  to 
sample  from  7r(A|y) . Hence  it  is  possible  to  obtain  iid  draws  from  the  posterior. 


CHAPTER  4 

CONCLUDING  REMARKS 

Obviously,  the  MCMC  samplers  that  we  have  considered  here  are  relatively  simple 
compared  to  most  of  those  being  used  in  realistic  settings.  For  example,  if  one  (or 
more)  of  the  full  conditionals  in  a Gibbs  sampler  is  a nonstandard  density,  then 
establishing  drift  and  minorization  conditions  is  sure  to  be  much  harder  than  it  was 
for  the  Gibbs  samplers  studied  here.  Furthermore,  replacing  the  Gibbs  update  of  the 
nonstandard  conditional  by  a Metropolis-Hastings  update  will  often  only  complicate 
the  calculations. 

On  the  other  hand,  there  is  some  hope  that  one  could  establish  geometric  er- 
godicity  for  certain  classes  of  models  (i.e  target  distributions).  In  fact,  Mengersen 
and  Tweedie  (1996),  Roberts  and  Tweedie  (1996)  Jarner  and  Hansen  (2000)  have  al- 
ready made  substantial  progress  in  this  direction  for  Metropolis-Hastings  algorithms. 
Unfortunately,  their  results  hold  only  for  fairly  narrow  classes  of  target  distributions. 

While  results  such  as  these  are  indeed  useful  what  might  also  be  fruitful  is  to 
consider  MCMC  samplers  for  more  specific  classes  of  models  such  as  the  block  Gibbs 
samplers  in  Section  3.5  or  the  hybrid  samplers  used  by  Natarajan  and  Kass  (2000) 
to  sample  from  the  posteriors  of  their  Bayesian  generalized  linear  mixed  models.  In 
both  cases,  the  models  are  quite  useful  for  data  analysis  and  hence  there  would  be 
a substantial  benefit  to  understanding  the  convergence  behavior  of  the  associated 
Markov  chains.  Of  course,  given  the  work  in  Chapter  3,  this  would  appear  to  be  a 
challenging  exercise.  There  is  clearly  a great  deal  of  work  to  be  done  before  rigorously 
addressing  (Ql)  and  (Q2)  becomes  standard  practice  when  applying  MCMC. 

If  nothing  else,  the  results  in  Chapters  2 and  3 show  that  formally  addressing  (Ql) 
and  (Q2)  can  be  hard  work.  Moreover,  recall  that  when  using  classical  Monte  Carlo 
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methods  based  on  independent  samples,  (Ql)  is  moot  and  (Q2)  is  easy.  Thus,  before 
resorting  to  MCMC,  one  should  try  the  Monte  Carlo  methods  based  on  independent 
samples;  e.g.,  rejection  sampling  or  importance  sampling.  In  other  words,  MCMC 
should  not  be  the  default  approach  when  one  is  confronted  with  analytically  difficult 
integrals.  This  may  sound  obvious,  but  so  does  wearing  a seat-belt. 
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